Rethinking Language Origins

Adam Kendon

Editor’s note: The following paragraphs are extracted from the concluding section of an essay that Kendon has been prompted to write by the recent book The Evolution of Language by Tecumseh Fitch (Cambridge University Press 2010). SemiotiX has published this essay in full, but here we present the paragraphs from it which summarize an approach to the problem of language origins which is not emphasised in Fitch’s book, but which Kendon believes is important for a more complete picture of the problem. It is suggested that more attention needs to be paid to the circumstances in which speakers realize utterances and to the way in which they do so as participants in occasions of interaction. The full version of Kendon’s essay can be found here.

So often, in language origins discussions, the ‘target to be explained in evolutionary terms is ‘language’ in the sense of the highly abstracted and idealized system that typically constitutes the object analyzed in linguistics. Although, indeed, an account of the development of such a system is needed, if the circumstances and manner of utterance production by actual human beings in concrete occasions of interaction is overlooked, important pieces of the puzzle we are trying to understand will almost certainly be missing.

Two issues of importance with respect to this are outlined in what follows. First of all, for linguistic utterances to be understood they must always be produced as embedded within occasions of interaction. It is therefore important to understand how occasions of interaction are built. After this, the point is made that whenever speakers produce utterances they always engage in a complex orchestration of vocal and visible bodily expressions, and no account of language evolution can overlook this.

“Behaviour is highly patterned and humans (and not only humans) are immersed in this patternedness from the beginning. We can gain some clues regarding this from the work of Erving Goffman (e.g. Goffman1961, 1963, 1973) whose work is not usually mentioned in the context of language origins discussions. From this we understand how people characteristically enter together into occasions of what he has called “focused interaction” in which they jointly agree upon what is relevant for the occasion and what may be disattended. Such jointly sustained attentional frames seem to be a fundamental feature of coherent interactions of any kind and it is only by seeing how communicative exchanges depend upon the creation of such jointly constructed shared “micro-worlds” that we can come to see how mutual understanding is achieved.

Approaching this problem from a somewhat different perspective, this point is similar to the one that Tomasello (2008) has been arguing for: that shared referential understanding can only come about within a joint attentional framework. That is, for an action to have common referential significance it is necessary for the participants to somehow share an understanding that they are both attending to the same things in the same way. The achievement of this joint attentional frame may be accomplished in a variety of ways, but in fully co-present or non-mediated interactions much depends upon delicate coordination between movements and orientations of the participants. It is through such coordination that shared cooperative intentions can become manifest for the participants and so be established among them. In some of my own work of some years ago (Kendon 1985, reprinted in Kendon 1990: 239-262), for example, I described how the spatial-orientational systems that participants in focused interaction can enter into and cooperatively sustain, play an important role in the means by which is achieved the attentional “frame attunement” necessary for the common understanding in terms of which participants’ actions make sense. This need not be done by words or by gestures, but by reciprocally sustained spatial and orientational manoeuvres. Accordingly, when it is seen that intelligible linguistic exchanges pre-suppose and depend upon the setting up of such joint attunements, we come to see that the very activity of uttering linguistic acts of some sort can only be understood when the setting up of interactional settings, the establishment of “participation frameworks” is also understood. There are now a number of good descriptions of this for human interaction.

Studies of great ape interaction that take a comparable approach would be extremely useful. A few beginning steps have been made, for example in the work of Simone Pika (Pika and Mitani 2009) with chimpanzees, the work of Joanne Tanner on gorillas (Tanner 2004) and see also the book by Barbara King The Dynamic Dance (King 2004). Such work will allow us to compare the organisation of occasions of interaction between species, not just the vocal and gestural signals they produce as discrete units of action (a common approach hitherto, see Call and Tomasello 2007), and this will greatly enrich our understanding of the circumstances in which the emergence of joint referentiality of actions, in whatever modality, might have been enabled. Almost nowhere in the language origins literature are issues of this sort discussed.

Another important feature of languaging, already alluded to above, is the fact that when speakers construct utterances they always do so through an orchestration of diverse semiotic resources (Goodwin 2000). Now although acknowledgement is often given to the fact that in speaking speakers also make use of “paralanguage” – intonational modulations in speech production and various kinds of kinesic accompaniments, these generally tend to be treated as auxiliary or decorative accompaniments and not as integral to the very activity by which an utterance is produced. We can, of course, write down a person’s words, presenting their utterances so that they seem to be made up only of words. However this does not represent what was actually done when the utterance was produced. Whenever a person speaks he employs, in a completely integrated fashion, patterns of voicing and intonation, pausings and rhythmicities, which are manifested not only audibly, but kinesically as well. Always there are movements of the eyes, the eyelids, the eyebrows, the brows, as well as the mouth, and patterns of action by the head. In addition there are, from time to time, variously conspicuous hand and forearm actions or ‘gestures’ (as they are often called), as well as postural and orientational changes. All of this is produced in full orchestration with speech – complex and variable, to be sure, but always orchestrated – and must be seen as inseparable components of the utterance as the utterer produces it. Few theorists of language offer an account of this. This may be, of course, because most theorists of language hitherto have not seen it as their business to do so since ‘language’ in most such cases is thought of as a self-contained, autonomous system that is confined to only one modality. But is this view of language anything other than a convenient abstraction? And if so, does it then constitute an appropriate target for evolutionary explanations? This question, posed a long time ago by Bolinger in his “Thoughts on ‘yep’ and ‘nope’” published in 1946 (Bolinger 1946), has been posed again more strikingly, perhaps, as a result of the recent work on sign languages (Liddell 2003: 355-362). When, after William Stokoe’s demonstration in 1960 that the “visual communication system of the deaf” (as he called it) could be analysed, at least to a considerable extent, in terms of the analytic principles developed for spoken languages by Bloomfield and his followers, such as George Trager (who directly taught Stokoe), there developed a determination to demonstrate that such structuralist principles were completely adequate for the analysis of sign languages, for in this way it would be shown that sign languages were indeed languages, and not, as had been maintained for the prior eighty years or so, “mere pantomime” or “unsystematic gesture”. In doing this, the concept of ‘language’ as a self-contained system was extended to include sign languages which meant that they also came to be conceived of as well demarcated autonomous systems. However, because there is no tradition of writing for sign languages and so no preestablished criteria for deciding what is “in” the language and what is “outside” it, any attempt to suppose that there can or should be a separation between ‘paralinguistic’ features and ‘linguistic’ features becomes very problematic. In recent years it has become clear that central to the construction of utterances in sign languages are forms of expression such as ‘classifiers’, ‘constructed action’, or “highly iconic forms” (on this, see Cuxac and Sallandre 2007), as well as an exploitation of space that is not possible in speech, but which have much in common with various kinesic devices used by speakers (although these are not as systematic in speakers). This proves to be an embarrassment to those who want to maintain a model of sign language that is derived from existing models in spoken language linguistics. On the other hand, this has also led others to suggest that when comparing spoken and signed language, the comparison should be with language as it is performed in speaking, for it is only the performed version of a sign language that is ever available. If this suggestion is followed, however, this means that, after all, from the point of view of how utterances are constructed, it is as essential to view what speakers do as an integrated performance in the study of spoken languages, as it is in the study of sign languages.

In short, we may suggest that the ‘natural’ state of spoken language is a speech-kinesis ensemble. Presumably, this has always been the case. With the development of writing, and its ultimate emergence as an autonomous form of language with its own properties, which, nevertheless, has provided the dominant model for what ‘language’ is, at least since the end of the eighteenth century, we have ceased to see how gestures and other aspects of utterance performance are a part of “what is said.” In (relatively) recent history, as human cultures have developed to sustain ever larger units of social organisation, especially, we repeat, with the development of writing technologies, the separation and specialisation of modalities of communication has been favoured. In many glottogenetic discussions, it is the separated modality of written-down spoken language (which dominates our conception of language) that tends to be projected backwards to the earliest days of language, making it very difficult to imagine how it might have arisen.

We may suppose, however, that just as ‘languaging’ is, in fact, a poly-modalic activity today, so it must have been in its beginnings (incidentally, to take this point of view obviates the problem that “gesture first” scenarios have raised).23 This leaves us with the question as to why there is this poly-modality and why, in particular, when speakers speak (or signers sign, for that matter) they tend to mobilise all kinds of bodily resources beyond those that might seem necessary from a mono-modalic point of view. The model of ‘language’ as an autonomous mono-modalic system, which tends to be taken as the target in so many language origins discussions, is a system that is a product of latter-day reflections on language, greatly facilitated ever since systems of writing came to be regarded mainly as representations of spoken language. A model of language of this kind is not appropriate to apply in those primordial times when what were to become specifically human forms of communication were first emerging. ‘Language’, as it is so often conceived of in contemporary language evolution discussions, is a late differentiation from a complex and dynamic orchestration of communicative action. Furthermore, it is a continually emerging system. The ‘target’ of our evolutionary explanations perhaps should be re-formulated so that the poly-modality of utterance is taken as the starting point. If this is done then ‘language’, when it is considered in its mono-modalic form, can then be understood as a consequence of processes of specialisation and differentiation from poly-modal ensembles of action. Accounts of language origins can then be recast to become accounts of these processes of progressive specialisation and diversification, emerged and emerging systems that are shaped through an evolution that involves social interaction as much as biology.”

The full version of Prof Kendon’s essay can be found here.


Bolinger, D. (1946). Some thoughts on “yep” and “nope”. American Speech, 21, 90-95.

Call, J. & Tomasello, M. (Eds.) (2007). The Gestural Communication of Apes and Monkeys. Mahwah, NJ: Lawrence Erlbaum Associates.

Cuxac, C. & Sallandre, M.-A. (2007). Iconicity and arbitrariness in French Sign Language: Highly iconic structures, degenerated iconicity and diagrammatic iconicity. In E. Pizzuto, P. Pietandrea, & R.Simone (Eds.), Verbal and Sign Languages: Comparing Structures, Concepts, and Mathodologies (pp. 13-33). Berlin: Mouton de Gruyter.

Goffman, E. (1963). Behavior in public places. New York: The Free Press of Glencoe.

Goffman, E. (1961). Encounters. Indiannapolis: Bobbs-Merrill.

Goffman, E. (1963). Behavior in Public Places. Notes on the Social Organization of Gatherings. New York: Free Press of Glencoe.

Goffman, E. (1974). Frame analysis. Cambridge, MA: Harvard University Press.

Goffman, E. (1981). Forms of talk. Philadelphia: University of Pennsylvania Press.

Goodwin, C. (2000). Action and embodiment within situated human interaction. Journal of Pragmatics, 32, 1489-1522.

Kendon, A. (1972). Some relationships between body motion and speech. An analysis of an example. In A. Siegman & B. Pope (Eds.), Studies in Dyadic Communication (pp.177-210). Elmsford, New York: Pergamon Press.

Kendon, A. (1990). Conducting Interaction: Patterns of Behavior in Focused Encounters. Cambridge: Cambridge University Press.

Kendon, A. (2002). Historical observations on the relationship between research on sign languages and language origins theory. In D. Armstrong, M. A. Karchmar, & J. V. Van Cleve (Eds.), The Study of Signed Languages: Essays in Honor of William C. Stokoe. (pp. 32-52). Washington, D. C.: Gallaudet University Press.

Kendon, A. (2004). Gesture: Visible Action as Utterance. Cambridge: Cambridge University Press.

Kendon, A. (2008). Signs for language origins? Public Journal of Semiotics, II (2): 2-29.

Kendon, A. (2009). Language’s matrix. Gesture, 9(3), 352-372.

Kendon, A. (In Press). ‘Gesture first’ or ‘Speech first’ in language origins? In D. J. Napoli & G. Mathur (Eds.), Deaf Around the World. New York: Oxford University Press.

King, B. J. (2004). The Dynamic Dance. Cambridge, MA: Harvard University Press.

Liddell, S. K. (2003). Grammar, Gesture and Meaning in American Sign Language. Cambridge: Cambridge University Press.

Pika, S. & Mitani, J. C. (2009). The directed scratch: evidence for a referential gesture in chimpanzees? In R. Botha & C. Knight (Eds.), The Prehistory of Language (pp. 166-180). Oxford: Oxford University Press.

Stokoe, W. C. (1960). Sign Language Structure: An outline of the visual communication systems of the American deaf. Studies in Linguistics Occasional Papers, 8, 1-78.

Tanner, J. E. (2004). Gestural phrases and gestural exchanges by a pair of zoo-living lowland gorillas. Gesture, 4(1), 1-24.

Be the first to comment

Leave a Reply

Your email address will not be published.