Go to Semiotic Review of Books Home Page
Go to SRB Highlights
Go to SRB Archives


SRB Archives

This article appeared in Volume 1 (2) of The Semiotic Review of Books.

From Thought To Speech

by Ronald H. Smyth

Speaking: from intention to articulation. By William J.M. Levelt, Cambridge, Massachusetts: The MIT Press, 1989. xvii + 499. Includes index. (ISBN G262-12137-9)

The first volume in the new ACL-MIT Press Series in Natural-Language Processing is the only truly comprehensive treatment of the psycholinguistic literature on language production ever to appear in print. The scope of Levelt's accomplishment is not to be underestimated: with more than 600 well chosen references, from the earliest speculative writings to late-1980's experimental work, this book will serve for many years to come as the authoritative source on the elusive question of how thought is translated into speech.

Why have we had to wait so long for someone to synthesize this literature? One reason is the mistaken impression among psycholinguists that relatively little research is done on production, because of the supposed difficulty of exerting sufficient experimental control over the stimulus situation. This impression is mistaken on two counts. First, as Levelt points out in his Preface, the literature on language production is in fast enormous, but it has been the victim of academic compartmentalization. He laments the lack of cross fertilization:

Other disciplines have asked the questions that psycholinguist have ignored Students of conversational analysis, pragmatics, discourse semantics, artificial intelligence, syntax, phonology, speech communication, and phonetics have contributed myriad thiouracil insights and empirical findings (p xiii).

The assumption that comprehension experiments are intrinsically easier to design is equally erroneous (Garrett, 1980). In a comprehension study, one typically works from linguistic theory to hypotheses about how a particular structure is processed, then presents that structure to subjects for some measurable response, such as reading time, priming, recall or recognition. In production, according to this view, one has little guarantee that manipulating the stimulus situation will yield the kinds of output one is interested in studying. Perhaps the most pessimistic account of this difference is to be found in Deese's introduction to his 1984 monograph,Thought into Speech: The Psychology of a Language:

I soon became convinced that there are no techniques of investigation nor are there any methods of analysis that would enable anyone to give an account of that transformation having the kind of validity we associate with the analytic study of processes in science We simply do not know how it is accomplished, and despite the availability of tools of physical measurement, such as the cortical evoked potential, we are very unlikely to in the near future (pp v-vi)

Deese apparently ignored Butterworth's (1980) cogent refutation of this sort of attitude. Butterworth shows that neither models nor research methods in comprehension have privileged scientific status. Indeed the range of research methods for production is at least as great as for comprehension. Moreover, the representation underlying production are not the only intangibles in psycholinguistics. The products of comprehension are equally remote from direct observation, and are only indirectly inferable from the subject's overt responses, such as button presses or question answering. Levelt's book, with its fine-grained, empirically supported picture of production, is perhaps the most convincing rebuttal of this pessimistic attitude.

Speaking is in essence a model of the three processing components hypothesized to comprise the human speech production mechanism. The Conceptualizer is the thought/language interface, generating language-specific "preverbal messages" which act as input to the Formulator. The Formulator encodes preverbal messages both grammatically and phonologically, generating surface structures in the form of phonetic plans, which then act as input to the Articulator. Finally, the Articulator executes the phonetic plans as coordinated muscle movements -over speech. The model is restricted, as a first approximation, to account for the spontaneous speech of normal adults; child language, paralanguage, communicative disorders, and the process of reading aloud are explicitly excluded.

This basic model is conceptually transparent, but Levelt's real accomplishment lies in his extremely rich and detailed discussion of the internal structure of each component, the kinds of representations produced at each substage, and the real-time interaction among components and subcomponents of the system. Marshalling evidence from formal linguistics and experimental psycholinguistics, he methodically displays the significant progress already made toward an understanding of what others have regarded as the last impenetrable domain of psycholinguistics. He is also careful to point out areas in which further work begs to be done. By showing us what "we" know collectively about production, Speaking makes this knowledge accessible to each of us individually.

The organization of the book follows the stages of real-time language production. Chapters 1 and 2 deal with background issues, including the basic system architecture and the role of conversational contexts. The core of the book consists of nine chapters on the nature of the representations generated by each component in turn: Chapters 3 and 4 cover preverbal message structure and generation; Chapters 5, 6 and 7 deal with the encoding of messages as lexical items and surface structures; and Chapters 8, 9, 10 and 11 describe the phonological encoding of both words and longer stretches of connected speech, as well as speech motor control. Chapter 12 deals with how speakers use their speech comprehension system to monitor their own output in order to ensure that it adheres to their conceptual, grammatical and articulatory intentions.

In Chapter 1, The Speaker as Information Processor, Levelt devotes over six pages to an analysis of the production of a single sentence uttered by a student in reply to a question from a professor. The phenomenological approach sets the stage for the more technical modelling to come in later chapters. The discussion then moves to more general issues: the autonomy of the three subprocessors, using standard arguments for modularity in language processing (e.g. Fodor, 1983); the division of duties between centrally controlled and automatic processes; and the system's ability to run its autonomous processors in parallel by means of Incremental Production. Each of these sections offers the reader an excellent introduction to a central issue in psycholinguistic theory.

Chapter 2, The Speaker as Interlocutor, confronts issues arising from the fast that speech is other directed. While this chapter has little to say about the actual process of message or sentence production, it does specify the kinds of choices imposed by the conversational context. These include interactional variables (turn-taking, engagement and disengagement moves); Grice's (1975) conversational maxims; person, place and time deixis; and the nature of intentionality in speech, as reflected in speech act theory. In each case the coverage is finely tuned to the needs of the subsequent chapters, yet the sections are worth reading as a general introduction to each of these fields of inquiry.

Chapter 3, The Structure of Messages, is an attempt to specify the features of preverbal messages. The main constraint arises from modularity assumptions: representations in one module must serve as the only kind of input representation to the next module (which in this case is the Formulator). In other words, pre-verbal messages must be ready for grammatical encoding, which means that they must organize the message into categories (persons, events, states, times, directions, etc.); assign thematic roles such as Agent, Location, Recipient, or Experience to arguments (e.g., Jackendoff, 1972, 1983); assign perspective and information structure (minimally, topic vs. comment status; given vs. new information distribution; focus; mood and modality; aspect; and deixis), and select language-specific semantic features that happen to require morphological or syntactic specification in the particular language (e.g.,Malinowski's (1920) description of classifiers in Kilivila). In addition, although the cognitive system as a whole may communicate internally using spatial, kinaesthetic, propositional and perhaps other representational systems, Levelt argues that preverbal messages- semantic representations of intended utterances - must be propositional. All of these concepts are familiar from the formal study of semantics; what is interesting is the need to conceive of their selection as a cognitive process under the control of a higher-level message plan.

Chapter 4, The Generation of Messages, introduces the notions of macroplanning and microplanning as distinct stages in the process of message encoding. Macroplanning involves the development of a semantic intention, such as informing a person of how to get from the train station to a particular part of the city. This will typically be subdivided into subgoals, such as telling her how to get to a particular landmark, from there to a subway station, and so on. As each subgoal is formulated, it is translated into a speech act intention by specifying the content and mood (declarative, interrogative, imperative). Macroplanning is thus the selection and sequencing of speech acts relating to different basic semantic properties of the message.

Microplanning is the process of translating the speech acts into propositional format, and filling in the perspective and information structure, as well as any obligatory language-specific features. Incremental processing allows these stages to be interleaved and various "bookkeeping" tasks, such as checking the current topic, presuppositions, allowable inferences, focus, and other relevant aspects of the conversation deploy memory and attentional resources to keep the speaker's messages on track. At this point, Levelt begins to introduce experimental work to support the nature of macro- and microplanning stages, such as the cyclic nature of hesitant and fluent phases in speech, the effects of the number of objects in an array on the specificity of referents, and adherence to natural order when describing temporal sequences.

Chapter 5, Surface Structure, is a short tutorial on the general form of sentence representation in generative grammar. Some readers will recognize the influence of Bresnan's (e.g.,1982) Lexical Functional Grammar, both here and in later sections on lexical entries. Level presents just enough grammatical theory to prepare the reader for the subsequent 100 pages on the generation of surface structures. Basic properties of phrase structure, categorial and constituent information, the distinction between configurational and nonconfigurational languages, subcategorization, internal vs. external arguments, modification structures and specifiers are all given brief treatment, as is the assignment of prosody on the basis of mood and focus.

Chapter 6, Lexical Entries and Accessing Lemmas, details the structure of the mental lexicon, granting it a central role in speech production. Briefly, it is claimed that preverbal messages trigger lexical items into an active state, but that only part of a lexical entry-its "lemma" (semantic representation and syntactic constraints is relevant to the Formulator's grammatical encoding. For example, a message plan involving the transfer of possession of a thing from one person to another triggers the lemma for the lexical entry "give". The lemma information imposes the morphological requirement that "give" be inflected for tense, aspect, mood, person and number; the prosodic requirement that it be assigned a particular pitch accent; and the syntactic requirement that it take three arguments with specific thematic and syntactic roles: a subject Agent, a direct object Theme, and an indirect object Goal. It also has a "lexical pointer" which directs the production mechanism to the memory address where the various forms are stored (give, gives, gave, given, giving), but form retrieval is claimed to be a separate type of processing. Whether it is a separate and subsequent stage of processing is unclear. Weak evidence comes from the tip-of-the-tongue (TOT) state: one knows the meaning of the elusive word, its part of speech, its place in the sentence-but not its phonological identity.

Lexical access during production is exceedingly fast: at a moderate rate of 150 words per minute, a speaker with an average vocabulary selects one lemma from among 30,000 choices every 400 milliseconds! Speech errors such as blends (e.g. "stummy" for "stomach" and "tummy") and word exchanges ("the page fits the text" for "the text fits the page"), further suggests that multiple lemmas are accessed in parallel. Yet it is unclear how the conceptual features of preverbal messages succeed in activating linguistically represented lemmas. Level reviews four models, evaluating their capacity to account for convergence on a single item: The Logogen Model (Morton 1969, 1979), the Discrimination Network Model (Goldman, 1975), the Decision Table Model (Miller and Johnson-Laird, 1976), and the Spreading Activation Model (e.g., Dell, 1986, 1988). The chapter ends with a taxonomy of speech errors which can be explained by failures in lemma access, and with a model of the time course of lexical access during speech, based largely on timed picture naming tasks under various priming conditions.

Chapter 7, The Generation of Surface Structure, presents particular model of the Grammatical Encoder which is both lexically driven and incremental (Kempen and Hoenkamp 1982, 1987). A lemma's syntactic category initiates a dedicated "categorial" procedure. For example, the lemma of "boys" is an N, which can only be the head of a noun phrase, so the NP procedure is triggered. It then looks for modifying or specifying information, and transfers control to "functional procedures" to build determiners, plural, and so on. The functional procedures hand the resulting representation back to the categorial procedure, which places it in the leftmost slot it is allowed to occupy. Procedures for more salient lemmas will be completed first, accounting for their tendency to occur, where there is a choice, earlier in the sentence. Finally, the categorial procedure gives its output a functional destination as head or complement of some higher-order procedure. Much of the discussion is intended to be an existence proof, demonstrating the plausibility of the model by showing how it can be made to work for for particular sentences. However, there are also numerous examples of speech errors that the model can account for, such as the exchange of NP's in "The child gave the cat the MOTHER", where the NP procedures receive the wrong head nouns, yet blindly assign focus and stress. Similarly, "Seymour sliced the knife with a SALAMI" shows stranding of both accent and definiteness.

The latter part of this chapter presents experimental results relating to grammatical encoding, suggesting that the "encoding rhythm" follows major semantic, rather than syntactic units; that highly topical and accessible information is usually encoded earlier and/or in grammatical roles that are higher in a syntactic importance hierarchy; and that phonological encoding has no effect on grammatical encoding, as predicted by the modularity hypothesis. For example, Bock (1986) asked subjects to describe a picture of lightning striking a church. Semantic primes like "worship" and "thunder" influenced which NP appeared in subject position ("The church is being hit by lightning" vs. "Lightning Is hitting the church"). However, phonological primes such as "search" or "frightening" had no such effect.

Chapter 8, Phonetic Plans for Words and Connected Speech, covers the target representations for the output of the Formulator phonetic plans which serve as input to the Articulator. Like Chapter 5, this is essentially a tutorial, this time on phonological theory. Once again, Levelt presents a beautifully concise summary of the issues, which he divides, as the title suggests, into separate discussions of phonetic plans for individual words in the form lexicon, and of phonological principles governing the concatenation of words in connected speech. The problem is this:

...the phonetic plan is a rhythmic (re-)syllabification of a string of segments. Each word s segments and basic rhythm are somehow stored in the form lexicon. When these word patterns are concatenated new patterns arise. Segments may get lost or added particularly at word boundaries. Syllables may be formed that cross word boundaries. Word accents are shifted to create a more regular rhythm for the larger string as a whole and so on. (p. 284)

Levelt stresses the interesting fast that, unlike surface structures, phonetic plans can be accessed consciously, as internal (subvocal) speech: The speaker can choose whether or not to execute them. Following Selkirk's (1982,1984) analysis, he outlines principles of derivation, inflection and compounding, and proceeds to describe the skeletal, syllable, segment, metrical and intonation tiers. The system of representation for connected speech follows the same partitioning into tiers, but the processes alluded to above alter the details of the representation. For example, reduced allomorphs may be chosen for auxiliary verbs ("Dick is running" - "Dick's running"), creating a new phonological word (Dick's). Words are then grouped into intonational phrases of variable length for the application of rhythm, rate, and intonation contour values, and so on.

Chapter 9, Generating Phonetic Plans for Words, addresses the question of how the Formulator retrieves the segmental spellout of each word in the surface structure, but sets aside for Chapter 10 all discussion of resyllabifisation in connected speech. Level contrasts two accounts of phonological encoding: the slots-and-fillers approach (e.g., Shattuck-Hufnagel, 1979), and the spreading-activation approach (e.g., Dell, 1986). Under the former model, there is a complete separation between the establishment of structural slots and the later filling of these slots with units that are spelled out independently. Under the latter model, fillers are nodes at different levels of representation (lemma, morpheme, syllable, syllabic constituent, segment), and activation spreads along arcs to both lower and higher levels.

Much of the research on the nature of phonetic plans Is based on everyday pathologies of speech. For example, people in the TOT state are quite good at guessing the initial and final phoneme or cluster of the elusive word, the number of syllables, and which syllable is stressed. This serves as evidence that when a lemma is accessed, form information does not necessarily follow on its heels. The use of speech error data to decide such issues is a much better-known area of investigation. One of the many topics covered here is the moveability of sublexical units such as morphemes ("I hate raining on a hitchy day"), syllables ("pussy cat" - "cassy put"), syllabic constituents ("space food" - "face spood"), segments out of syllable constituents ("feel like playing"- "peel like flaying" where the p is part of the syllabic onset), and so on. Levelt's discussion focusses on the degree to which each of the models can account for both spontaneous and laboratory-induced errors.

In Chapter 10, Generating Phonetic Plans for Connected Speech, the discussion moves to the morphological, segmental, and prosodic alterations made to word plans by the Prosody Generator, which accepts four kinds of input: surface structure and pitch-accent information for prosody adjustments; metrical spellout for metrical adjustments; attitude and emotional variables for adjustments to speech rate, pauses, volume etc.; and segmental spellout for adjustments to the structure of phonological words.

Here too, speech errors provide crucial evidence. Shifts such as "We tried It making...making it with gravy" ignore the usual constraint on word exchanges, i.e., that both elements be of the same syntactic category. Levelt suggests that these otherwise anomalous errors occur in the transition from surface structure to morphological and metrical spellout. Very fast spellout for highly frequent words results in filling the wrong address frame; because metrical structure has already been assigned at this point, the shifted work's pitch accent will remain, rather than being stranded as we saw in the earlier examples. Observations of natural speech can also serve to test prosodic hypotheses. For example, Brown, Currie and Kenworthy (1980) showed that the range of pitch movement is greater when a speaker introduces a new topic, and Is reduced as the topic becomes exhausted.

In Chapter 11, Articulating, we examine the Articulator, which does not create representations, but simply executes them. Levelt deals with three issues: the nature of the Articulatory Buffer, speech articulation and acoustics, and speech motor control.

Evidence for a special temporary store for phonetic plans comes from measures of voice onset latency. Klapp, Anderson and Berrian (1973, 1976) had subjects read one- and two-syllable words with the same number of letters (e.g., "clock" and "camel"), and found the latter to take 14 msec longer than the former. This was not due to differences in initial word perception; subjects who merely said "yes" or "no" depending on the category (animal or object) assigned to them, showed no such difference. Subjects who waited 3 seconds after stimulus onset for a "Go" signal likewise showed no difference, which means that the effect Is one of phonological encoding, and not of utterance initiation. Further evidence shows that there are separate, measurable stages of retrieval from the buffer, unpacking, and command execution.

The contents of the sections on speech production and acoustics is likely to be familiar to most readers who choose to tackle Level's book, but the section on speech motor control is a welcome addition. At issue is whether there is a division of duties between invariant articulatory programmes and a contest-dependent execution system which translates them into neuro-muscular activity appropriate to the current speech situation, such as compensation for food in the mouth. One argument is based on theoretical parsimony: the Infinite variability of contexts of execution would require an infinite number of motor programmes, and a concomitant increase in the amount of motor planning required of the Articulator. Level uses this observation to dismiss two theories of motor control, and a variety of other arguments to dismiss three more. He then adopts the Coordinative Structures model (Easton, 1972), which is based on the observation that a given set of muscles can be organized into different systems according to its current task. For example, chewing involves the same muscles as speech, but coordination patterns vary immensely between the two activities.

Because there Is relatively little overlap between muscle systems for consonant and vowel production, co-production of both segment types should lead to optimal production efficiency. The chapter concludes with evidence that the vocal tract, in its speech mode, is a coordinative structure set to produce strings of syllables.

Chapter 12, Self-Monitoring and Self-Repair, examines how speakers detect and correct production errors which, as Levelt demonstrates, can occur at any stage In the process. That there is selective attention during monitoring is evident from speech-error-inducing studies such as Motley (19w0). Subjects see a series of biasing word pairs, such as "ball-dome, bed-deep..." then pronounce a target pair which breaks the pattern, such as "darn-bore". Real-word slips such as "barn-door" are edited out more often than non-word slips, unless all biasing items are non-words such as "bep-deeb". In other studies, socially unacceptable errors ("tool-kits" - "cool tits") are less frequent than acceptable ones ("darn-bore"). Levelt rejects spreading-activation accounts of monitoring, which posit a single system for production and comprehension. In his model, the editor is the speaker's own language comprehension system, as suggested by the fact that more errors go undetected when subjects are exposed to white noise. Analysis of where an utterance is interrupted for repair, the kinds of editing expressions used (see James, 1972,1973 for semantic differences among uh, oh, and ah), restarting methods (instant vs. retracing), prosodic signals that a repair Is in progress, and on-the-fly repairs which lead to syntactic distortions (e.g., "It seems to be a good marriage, of her parents") are all given clear treatment. Of special Interest Is Levelt's claim that retracing to constituent boundaries, commonly held to signify that constituents are planning units, is trivial. Due to the properties of constituents In right-branching languages, 89% of the words In his own repair corpus begin constituents!

Although Speaking is a book for the specialist, It offers much of Interest to the curious outsider seeking to understand the enterprise of psycholinguistic modelling. Levelt's modest admission that "this book is incomplete and theoretically wanting, even in the areas on which It focuses" (p.xiv) is no doubt true, but this has far more to do with the lack of previous efforts to organize the work on production than with any failing on his part.


Bock, K.(1986) "Meaning, sound, and syntax: Lexical priming in sentence production." Journal of Experimental Psychology: Learning Memory and Cognition 12 575-586.

Bresnan, J.W.(ed.)(1982) The Mental Representation of Grammatical Relations. Cambridge: MIT PRESS.

Brown, G., Currie, K L.,& Kenworthy, J. (1980) Questions of intonation. London: Croom Helm.

Butterworth, B. (1980) "Introduction." In B. Butterworth (Ed.), Language production: Vol. 1 Speech and talk. London: Academic Press.

Deese, J. (1984) Thought into speech: The psychology of a language. Englewood Cliffs, NJ: Prentice-Hall.

Dell, G.S. (1986) "A spreading activation theory of retrieval in sentence production." Psychological Review 93 28:3 321.

---. (1988) "The retrieval of phonological forms in production: Tests of predictions from a connectionist model." Journal of Memory and Language 27 124-142.

Easton, T.A. (1972) "On the normal uses of reflexes." American Scientist 60 591-599.

Fodor, J. (1983). The modularity of mind. Cambridge, MA. MIT Press.

Garrett, M. (1980). "Levels of processing in sentence production." In Butterworth, B. (ed.) Language Production. Vol.1. Speech and Talk. London: Academic Press.

Goldman, N. (1975)." Conceptual generation." In R. Schank (Ed.), Conceptual Information processing. Amsterdam: North-Holland.

Grice, J. (1975) The thread of discourse. The Hague: Mouton.

Jackendoff, R. (1972) Semantic Interpretation in generative grammar. Cambridge, MA: MIT Press.

---. (1983) Semantics and cognition. Cambridge, MA: MIT Press.

James, D. (1972) Some aspects of the syntax and semantics of interjections. Papers from the Eighth Regional Meeting. Chicago Linguistic Society.

---.(1973) Another look at, say, some grammatical constraints on, ok, interjections and hesitations. Papers from the Ninth Regional Meeting. Chicago Linguistic Society.

Kempen, G. & Hoenkamp, E.(1982)" Incremental sentence generation: Implications for the structure of a syntactic processor." In J. Horecky (Ed.), Proceedings of the Ninth International Conference on Computational Linguistics. Amsterdam: North-Holland.

Kempen, G. and Hoenkamp, E.(1987) "An incremental procedural grammar for sentence formulation." Cognitive Science. 11. 201-258.

Klapp, S.T., Anderson, W.G. & Berrian, R.W.(1973) "Implicit speech in reading reconsidered." Journal of Experimental Psychology 100. 365-374.

Malinowski, B.(1920)" Classificatory particles in the language of Kiriwina." Bulletin of the School of Oriental Studies 1. 33-78.

Miller, G.A.& Johnson-Laird, P.N.(1976) Language and perception. Cambridge, MA: Harvard University Press.

Morton, J.(1969) "The interaction of information in word-recognition." Psychological Review 76 165-178.

---.(1979) "Word recognition." In J. Morton & J. Marshall(Ed.) Psycholinguistics: Series 2.

Motley, M.T. (1980) "Verification of 'Fruedian slips' and semantic prearticulatory editing via laboratory induced spoonerisms." In Y.A.Fromkin (Ed.) Errors in linguistic performance: Slips of the tongue, ear, pen, and hand. New York: Academic Press.

Selkirk, E.(1982) The syntax of words (Linguistic Inquiry Monograph 7). Cambridge, MA: MIT Press.

---.(1984) Phonology and syntax: The relation between sound and structure. Cambridge, MA: MIT Press.

Shattuck-Hufnagel, S.(1979) "Speech errors as evidence for a serial order mechanism in sentence production." In W.E. Cooper & E.C.T. Walker (Ed.) Sentence processing: Psycholinguistic studies presented to Merrill Garrett. Hillsdale, NJ: Lawrence Erlbaum.

Ron Smyth is an assistant professor of Linguistics and Psychology at the University of Toronto's Scarborough Campus. His research involves theoretical issues in both linguistic and cognitive theory as they relate to language comprehension and acquisition.

Go to Semiotic Review of Books Home Page
Go to SRB Highlights
Go to SRB Archives