Introducing the Corpus of Conversational Uyghur

Accepted Paper

Michael Fiddler (Boğaziçi University)

Paper abstract

In this presentation, I will report on the construction of a spoken corpus consisting of unscripted casual conversations in Uyghur (Turkic; ISO 639-3 uig; Glottolog uigh1240) recorded in naturalistic settings by speakers in diaspora communities. The conversations are recorded by contributors in their homes or other local spaces using their own mobile phones or other recording devices, with no researcher present. Currently eight conversations totaling ~2.5 hours have been recorded; the goal is 10+ conversations. Annotated transcripts, which include morphological analysis and English glossing + translation, have been produced for the first four conversations, and work is under way for the remaining material. The recordings and transcripts will be published on www.tilim.org, a UPF-maintained website devoted to Uyghur language resources. We will have one version of the corpus webpage in Uyghur and another in English (and possibly other languages, if the need arises).

This corpus project, undertaken in collaboration with the Uyghur Projects Foundation, aims to contribute a new resource that can be of value for both community use and scholarly research. Within the Uyghur homeland in northwest China, Uyghur language and culture is facing intense repression, and diaspora communities also face the challenge of passing on the language to young generations surrounded by a majority language like Turkish or English. Uyghur diaspora scholars and activists have already been building resources such as collections of Uyghur-language books in pdf and audio-book form, language-learning and literacy resources for children and adult learners, online media channels for television, music, and film, etc. (see, e.g., resources listed at www.tilim.org/ulanmilar).

The presentation will include discussion of methodological issues involved in this corpus project as well as the kind of applications the corpus will be useful for. In terms of methodology, the Covid-19 pandemic necessitated remote data collection methods, and repressive tactics of the Chinese government made the security and privacy of the participants an important consideration. As a source of research data, the corpus is already being used to examine linguistic features such as stress and intonation, and the relationship between syntax and intonation. Further studies in morphology and syntax would certainly be feasible, as well as interactive aspects of conversation such as turn-taking, politeness, etc. Finally, for heritage speakers or second-language learners of Uyghur, the conversations could serve as learning materials, either incorporated into classroom learning or as an informal resource for use in independent study.

Panel LANG02
Applications of corpus methods in research (showcasing languages in Central and Northern Eurasia).
Session 1 Friday 20 October, 2023, 15:30-17:15