What does a corpus represent? An ethnographic perspective on corpus design for Kazakh

Accepted Paper

Giorgia Troiani (UC Santa Barbara)

Paper abstract

For researchers interested in any given language, a corpus of linguistic data that represents the language well is more and more considered an essential tool. But if no corpus exists of the desired type – for example, no corpus of everyday conversation – the question arises: “Can we build one?” Practically, the next two questions are likely to be, “What are the best practices?” and “How much does it cost?” But first, a more basic question must be answered: “What will the corpus be used for?” Constructing a spoken corpus is expensive and time-consuming, and it is easier to justify if the corpus is built to last, serving the needs of a variety of users far into the future.

The history of modern Kazakhstan is complicated, reflecting its position along the traditional trade routes and migratory pathways of Central Asia, as well as the legacy of Soviet educational policies, and the forced relocation of ethnic groups. The result is a complex linguistic landscape with fully institutionalized multilingualism on a grand scale. The coexistence of multiple official languages and heritage languages gives rise to contrasting definitions of fluency, where the question, “Who is a speaker?” may translate to “Who is to be represented?” In such a context, adopting an ethnographic perspective can help make sense of the linguistic and sociocultural complexity.

In this paper, we present the concept of corpus ethnography, arguing that the most effective representation of a language is one based on representing naturally occurring language use, as it emerges from the intrinsic motivations of language users engaged in the pursuit of their social life. The ethnographic perspective motivates certain design decisions for corpus construction, shaping the preferred methods for recording, transcribing, and representing language in use. We present examples drawn from our experience as contributors to the design and construction of corpora of both high and low resource languages, including especially (with a large team from Nazarbayev University) the Multimodal Corpus of Spoken Kazakh Language, as well as the Santa Barbara Corpus of Spoken American English, the Corpus of Sakapultek Maya Narrative and Conversation. We present recordings of Kazakh to show how prioritizing the participants’ own motivations for interaction over purely structural linguistic criteria leads to an organic representation of language in everyday life. Speech genres range from everyday conversation to genres such as ritual that might be excluded from a traditional linguistic corpus.

Panel LANG01
Theoretical and methodological issues of language data representation in Central Asia
Session 1 Friday 20 October, 2023, 13:30-15:15