LANG02: Applications of corpus methods in research (showcasing languages in Central and Northern Eurasia).

LANG02

Applications of corpus methods in research (showcasing languages in Central and Northern Eurasia).

Convenor:: Nikolay Mikhailov (Nazarbayev University)
Send message to Convenor

Chair:: Andrey Filchenko (Nazarbayev University)

Discussant:: Nikolay Mikhailov (Nazarbayev University)

Format:: Panel

Theme:: Language & Linguistics

Location:: William Pitt Union (WPU): room 837

Sessions:: Friday 20 October, 15:30-17:15
Time zone: America/New_York

Abstract

Spoken language may inform many types of research, offering insights that are almost exclusive to this methodology. Anthropologists, sociologists, and many other disciplinary specialists may include corpus linguistics methods in their research to consider the diverse phenomena under analysis through the perspective of spoken language in its interactional nature - essentially, the utility for capturing life. One of the goals of the panel is also to demonstrate the significant research potential of spoken language corpora methodology and at the same time, this methodology's extremely limited contemporary application for the languages of Central and Northern Eurasia, exemplified by a still small range of projects on Kazakh, Dungan, Udmurt, among a few others.

This panel aims to bring together researchers who are studying the language and connected phenomena in Central and Northern Eurasia. The panelists will showcase their projects that are utilizing spoken language data, in conjunction with other data about the region, culture, and society, to form a well-informed interdisciplinary research agenda.

The panel will be of interest to researchers, practitioners, and students in linguistics, language technology, anthropology, psychology, sociology, education, history, and other related fields, particularly those interested in the Central and Northern Eurasian languages and cultures.

Accepted papers

Session 1 Friday 20 October, 2023, 15:30-17:15

Linguistic corpus design as negotiation: A case study of Kazakhstani Gansu Dungan

Sami Honkasalo (University of Helsinki) Wulaer Nuerlan (Nazarbayev University) Zhamilya Abik (Nazarbayev University)

Send message to Authors

Paper abstract

Keywords: Dungan, linguistic corpora, field linguistics

Based on the authors’ ongoing project on Kazakhstani Gansu Dungan, the paper argues that creating a linguistic corpus with the purpose of grammatical description and documentation of an understudied language is more accurately interpreted as a process of negotiation at the crossroads of the initial vision and adaptability into what comes up. Here, we follow Mosel’s (to appear) division of linguistic corpora into corpora of major languages and language documentation corpora, focusing only on the latter and addressing two methodological issues from the viewpoint of negotiation. First, discourse on corpora remains strongly teleological in a forward-looking sense: A linguistic need, such as documentation of a linguistic variety, is identified, and the created corpora with its scope of materials offers a solution to the issue. This, however, may be a mere ‘backwards projection’. As we will illustrate with Dungan, corpus design often evolves within the external constraints the project faces so that the end result crystallizes only post factum and not infrequently shaped even by pure serendipity. Second, the existing discourse and practice on corpus design runs the risk of both downplaying the ‘fuzzy boundaries of languages’ (see Weber and Horner 2012) and extracting idealized forms of languages. Many earlier descriptions of Dungan treat it largely as an ‘ordinary dialect of Mandarin’ (see e.g. Lin 2012), yet our documentation has shown that Kazakhstani Dungan has become a contact language intertwined with Russian where separating borrowing, code-switching, and code-mixing remains a challenge and possibly an unfruitful task. To address the issue, in the context of Dungan, we propose the notion of a ‘corpus of communicative activities’. Rather than trying to extract an essentialized ideal Dungan language, this approach embraces the everyday discourse practices of the Dungan people. It is argued that the approach is broadly applicable particularly in the multilingual Central Asian context.

References

Lin Tao (林涛). 2012.东干语调查研究 [A Study of the Dungan Language in Central Asia]. Beijing: China Social Sciences Press.

Mosel, Ulrike. To appear. Corpus building for under-researched languages: A practical guide. In Firmin Ahoua, Dafydd Gibbon, and Stavros Skopeteas. Linguistic Fieldwork and Language Documentation: A Course Book on Foundational Skills.

Weber, Jean-Jacques & Horner, Kristine. 2012. Introducing Multilingualism: A Social Approach. London and New York: Routledge.

Introducing the Corpus of Conversational Uyghur

Michael Fiddler (University of California, Santa Barbara)

Send message to Author

Paper abstract

In this presentation, I will report on the construction of a spoken corpus consisting of unscripted casual conversations in Uyghur (Turkic; ISO 639-3 uig; Glottolog uigh1240) recorded in naturalistic settings by speakers in diaspora communities. The conversations are recorded by contributors in their homes or other local spaces using their own mobile phones or other recording devices, with no researcher present. Currently eight conversations totaling ~2.5 hours have been recorded; the goal is 10+ conversations. Annotated transcripts, which include morphological analysis and English glossing + translation, have been produced for the first four conversations, and work is under way for the remaining material. The recordings and transcripts will be published on www.tilim.org, a UPF-maintained website devoted to Uyghur language resources. We will have one version of the corpus webpage in Uyghur and another in English (and possibly other languages, if the need arises).

This corpus project, undertaken in collaboration with the Uyghur Projects Foundation, aims to contribute a new resource that can be of value for both community use and scholarly research. Within the Uyghur homeland in northwest China, Uyghur language and culture is facing intense repression, and diaspora communities also face the challenge of passing on the language to young generations surrounded by a majority language like Turkish or English. Uyghur diaspora scholars and activists have already been building resources such as collections of Uyghur-language books in pdf and audio-book form, language-learning and literacy resources for children and adult learners, online media channels for television, music, and film, etc. (see, e.g., resources listed at www.tilim.org/ulanmilar).

The presentation will include discussion of methodological issues involved in this corpus project as well as the kind of applications the corpus will be useful for. In terms of methodology, the Covid-19 pandemic necessitated remote data collection methods, and repressive tactics of the Chinese government made the security and privacy of the participants an important consideration. As a source of research data, the corpus is already being used to examine linguistic features such as stress and intonation, and the relationship between syntax and intonation. Further studies in morphology and syntax would certainly be feasible, as well as interactive aspects of conversation such as turn-taking, politeness, etc. Finally, for heritage speakers or second-language learners of Uyghur, the conversations could serve as learning materials, either incorporated into classroom learning or as an informal resource for use in independent study.

‘We can’t beat the Westerners’: the role of speech rate and phonological reduction in shaping perceived rudeness of Western Kazakh speakers

Moldir Bizhanova (Nazarbayev University) Zhansaya Turaliyeva (Nazarbayev University)

Send message to Authors

Paper abstract

One of the most widespread language ideologies associated with Kazakh dialects sees speakers from the Western region being fast speakers (Bizhanova, 2022). This condition is linked to the self-reported inability of speakers of other regions to understand Western Kazakh (WK):

‘It seems that we can’t beat the Westerners [in speaking fast]. Yes, really. I don’t understand the words of people from the West region’ [MCSKL mobi141121.wav]

Folk accounts of the phenomenon indicate that acoustic correlates of speech (speech rate and phonological reduction) are the underlying reasons why WK speakers are perceived as fast speakers.

In this paper, we establish whether WK speakers do actually exhibit faster-than-average speech rate and higher phonological reduction in comparison to other dialects' speakers. We base our analysis on naturally occurring Kazakh conversations featuring 6 people, representing the three dialects spoken in Kazakhstan: Southern, North-Eastern, and Western.

We first analyze speech rate. We extracted the narrative portions of each conversation (2 minutes per speaker), counted the syllables produced by each speaker, and divided them by time (seconds). In terms of speech rate, the values for the Southern dialect speakers were found to be higher than other regions (5.67 syll/s for SK, 5.19 syll/s for WK, 4.58 syll/s for NEK). This analysis suggests that the language ideology has no support in the notion that the speech rate of WK speakers is faster than average.

We then consider whether WK speakers exhibit higher-than-average rates of phonological reduction in conversation. We restrict our focus to verbal constructions because they are an element that is often self-reported by speakers as being a main site for reduction. We are currently in the process of finalizing the analysis of the data for this portion of the study. Preliminary results suggest that WK speakers exhibit phonological reduction (final vowel drop and final consonant sonorization).

We complemented these results with the analysis of sociolinguistic interviews where we asked speakers how they perceived their speech to differ from that of speakers of other regions. In general, WK speakers were indicated to talk faster than average. We propose that, in iconicization process, participants have indexically ascribed to WK speakers personality features (“aggressiveness”, “harshness” and “rudeness”), explicitly linking these qualities to the supposed faster-than-average speech of WK speakers. The results from the quantitative analysis suggest that speech rate is not responsible for the perceived speed of WK speakers, but phonological reduction may play a role.

Distribution of relative clause constructions in Kazakh conversation

Akyl Akanov (Nazarbayev University) Giorgia Troiani (UC Santa Barbara)

Send message to Authors

Paper abstract

In a conversation, speakers need to make the objects/people they are talking about (referents) concrete for interlocutors. This can be achieved through relative clauses (RCs). The relative clause in (1) isolates and identifies one specific referent from a group, while the RC in (2) adds a commentary about a referent that is already identified.

1. The class [that I attended yesterday] was interesting.

2. My aunt, [who lives in Almaty], came to visit me.

Traditionally, studies on RCs focus on formal syntactic properties, either in a language-specific (McCawley, 1981) or cross-linguistic perspective (Andrews, 2007). However, recent research suggests that speakers make systematic decisions about which form of RCs to use based on their judgment of the interlocutor’s state of knowledge (Fox and Thompson, 2007).

In this paper, we apply an interactional approach to the study of relative clauses in Kazakh (Turkic). We account for the distribution and patterns of use of RCs in everyday Kazakh conversations through the analysis of 6 naturally occurring informal conversations (3 hours). Contrary to traditional grammatical accounts, our analysis indicates that Kazakh speakers use two types of RC constructions. A Turkic-type construction (3), which exhibits a gap, a common relativization strategy found across the languages of the Turkic family, and an Indo-European-type construction (4), which employs a relative pronoun (kotorıy), a relativization strategy that is typical of the Indo-European family.

3. Ol [satıp alğan] qarmalar öŋkey däw eken.

“All the karma breads [that he bought] were apparently very big.”

4. Wnïversïtette küşti vraç bar, [kotorıy atın bilmeymin men].

“There is a great doctor at university, [whose name I do not know].”

Our data suggest that Kazakh speakers have developed an Indo-European-type of relativization strategy under the influence of Russian. The addition of this strategy to the system caused a shift in the meaning of each construction. While Turkic-type constructions (3) take on the function of restricting the referent (cf. 1), Indo-european-type constructions (4) specialize their meaning to supplying additional information or commentary about a referent (cf. 2).

These findings link grammar to interaction and uncover the sociocultural underpinnings of language use. Firstly, they demonstrate the importance of observing linguistic structures in their interactional context and not as a set of syntactic features in isolation. Secondly, they suggest that contact-induced language change is not a purely linguistic phenomenon, but rather a development with far-reaching implication for the social organization and management of interactions in a culture.