T08LANG: Exploring Low-Resource Languages through Corpus Work: Challenges, Innovations, and Insights

T08LANG

Exploring Low-Resource Languages through Corpus Work: Challenges, Innovations, and Insights

Convenor:: Nikolay Mikhailov (Nazarbayev University)
Send message to Convenor

Chair:: Nikolay Mikhailov (Nazarbayev University)

Discussant:: Nikolay Mikhailov (Nazarbayev University)

Format:: Panel

Theme:: Language & Linguistics

Location:: Hall of Turan civilization (Floor 1)

Sessions:: Thursday 6 June, 13:00-14:45
Time zone: Asia/Almaty

Abstract

This panel seeks to explore the intricate landscape of low-resource languages (LRLs) in Central Asia, using the Kazakh language as an example, by examining the methodologies, challenges, and discoveries within corpus linguistics applied to these linguistic domains. The study of LRLs has garnered significant attention from researchers across various disciplines due to recent innovations in computational technologies. However, the scarcity of resources in terms of data and tools presents great challenges in conducting comprehensive linguistic analysis and documentation.

The panel brings together researchers who have navigated these challenges and made significant contributions towards corpus work on LRLs, specifically the Kazakh language. The studies presented in the panel will highlight the approaches and methodologies employed in collecting, annotating, and analyzing linguistic data from low-resource settings. Furthermore, they will address the ethical considerations, emphasizing the importance of respectful collaboration and responsible data stewardship concerning natural spoken language data.

Key themes to be explored include methodological innovations, language documentation, sociolinguistic implications, and technology tools, presenting a corpus of spoken Kazakh language. The panelists will discuss strategies developed to overcome data scarcity and linguistic documentation challenges, like leveraging crowd-sourcing techniques, adapting existing tools for LRLs, and employing community-based participatory research methods to build and annotate corpora, showcasing the successful application of those methods through a corpus of spoken Kazakh language, which will be available to researchers wishing to explore the new perspective. The papers will showcase the diverse linguistic phenomena and cultural insights gleaned through corpus-based studies of LRLs, particularly in the domain of spoken language. The presented projects that rely on the spoken corpus of the Kazakh language will demonstrate the phenomena that often remain unseen when only employing literary textual corpora of a language. Finally, the panel will address the broader sociolinguistic implications of spoken corpus work on LRLs, including language revitalization efforts, linguistic rights advocacy, and community empowerment initiatives. By actively engaging with local communities and stakeholders, researchers can ensure that their work contributes meaningfully to preserving and promoting linguistic diversity, diverging from the existing textual corpora of the literary Kazakh language.

Overall, this panel seeks to showcase the transformative potential of corpus linguistics in advancing our understanding of low-resource languages in Central Asia, while also advocating for ethical and inclusive research practices that center the voices and agency of language speakers. Through collaboration, innovation, and mutual respect, we can collectively work towards a more equitable and linguistically diverse world.

Accepted papers

Session 1 Thursday 6 June, 2024, 13:00-14:45

Introduction to MCSKL, the first multimedia corpus of spoken Kazakh languages: challenges and possibilities.

Nikolay Mikhailov (Nazarbayev University) Giorgia Troiani (UC Santa Barbara) Andrey Filchenko (Nazarbayev University)

Abstract

Despite the rich cultural heritage of the Kazakh language, there exists a significant gap in linguistic resources for it, particularly in the domain of spoken corpora. This presentation introduces the spoken corpus of the Kazakh language. Although a number of corpora projects exist declaring spoken corpora as their components, MultiCorSKL is the first-ever corpus focusing on genuine naturally occurring spoken discourse, addressing a critical need for such data and analysis in computational linguistics and language diversity preservation.

The project created a comprehensive spoken corpus of modern Kazakh language using a team of trained linguists as well as innovative crowd-sourcing techniques to work with the organic interactional language data. Apart from aiming to collect representative spoken language data, we sought to implement reliable methods of annotation of conversational data, facilitating its use in training automated speech recognition (ASR) models, which has not been done with great success up to this point.

Through the crowd-sourcing approach, we engaged speakers of the Kazakh language across diverse regions and social strata to contribute to the corpus. The collected data underwent annotation, at various levels, from simple orthography to deep annotation using Intonation Unit segmentation, orthographic transcripts, Discourse Functional Transcription, IPA notation, morphemic glossing, integrated within ELAN, enabling detailed and accurate linguistic analysis.

In the most recent audit, the corpus contains 180 hours of recorded data, with 80 hours of data annotated, and the data volume is expected to grow significantly through crowd-sourced recording and automation of annotation. Conversational ASR, an important workflow element, used the original project data, segmented into intonation units, resulting in approximately 30000 files, to train and test the models.

The development of the MultiCorSKL represents a landmark achievement in the studies of low-resource languages. It not only enables advanced research in computational and corpus linguistics but also holds profound implications for cultural preservation and technological advancements in Kazakhstan. Various disciplinary domains such as socio- and cognitive linguistics, conversational analysis, comparative morphosyntax, among others will benefit from the access to spoken corpus data, which provides reliable empirical insights into the spontaneous, day-to-day language usage.

Future work will focus on expanding the corpus and refining ASR models to enhance linguistic accessibility and digital inclusivity. We invite researchers and technologists to explore the corpus and contribute to its expansion. Collaboration is essential to harness the full potential of this unique resource for linguistic research in various areas and technological innovation.

Title: “Qa baram!:” The role of phonological reduction in shaping negative language attitudes towards Western Kazakh speakers

Zhansaya Berik (Nazarbayev University)

Abstract

People often perceive the speech of individuals from the Western regions of the country negatively. Reports from other regions' speakers claim that understanding Western Kazakh (WK) is difficult because the spoken language is faster than other varieties of Kazakh (Bizhanova, 2022). Possibly for such a reason, WK is perceived to be rude.

A pilot investigation revealed that, while WK speakers' speech rate is not the fastest, WK speakers employ phonological reduction more often. Specifically, Westerners phonologically reduce verbs 50% more than North-Eastern Kazakh speakers. This finding suggests phonological reduction may explain why WK speakers are perceived to be fast speakers.

It now remains to understand how the perception of this phonological feature is extended by listeners to a moral assessment about WK speakers. Iconization is a process through which language users infuse structural linguistic elements with symbolic significance, often reflecting cultural values or moral judgments (Irvine&Gal, 2000). In this study, I will show the steps through which phonological reduction is iconized into rudeness evaluations in the case of WK.

To address this question, I conducted a matched-guise experiment. Participants from different regions of Kazakhstan listened to an audio sample, featuring a WK speaker, without knowing the dialect and had to guess it solely based on the audio. The findings indicate that the half of participants successfully identified the dialect as WK, while the rest identified it as Southern. This suggests that the dialect is not readily distinguishable to native speakers of Kazakh unless they are informed that the speaker is from the Western region, and consequently, speaks WK. The outcome of the performance prompts an inquiry into the nature of the comments evoked thereby.

Subsequently, I analyze the responses of these participants to identify whether they perceive WK negatively and if so, the reasons behind such perceptions. People indicate that the negative perceptions may result from cultural beliefs associating the Kishi Juz, inhabitants of the Western region, with traits like cruelty and aggression due to being historically known for their warrior heritage. Moreover, every participant agrees that the frequent reduction of verbs in the speech affects their comprehension more, which makes them feel irritated. These results indicate that the negative attitudes towards the WK are primarily ideological constructs, which may not be negatively perceived in the absence of relevant information.

Concluding, results suggest that the frequent phonological reduction of WK speakers may affect the negative attitudes about WK being shaped.

Morphosyntactic Accommodation of Chinese Content Words in Kazakh: The Case of Kazakh-Chinese Language Contact

Wulaer Nuerlan (Nazarbayev University)

Abstract

Due to political, social, and economic changes, there has been close contact between Chinese and minority languages spoken in China today. This study explores how Chinese content words are accommodated morphosyntactically in Kazakh in the context of Kazakh-Chinese language contact in Xinjiang. Although borrowing strategies have been studied for many languages of the world, there is a gap in the literature regarding spoken Kazakh in Xinjiang and Chinese lexical borrowing in Kazakh. The data used in this study represents the recordings I collected in Xinjiang as a research assistant in the Multimedia Corpus of Modern Spoken Kazakh Language project. After analyzing 6.8 hours of conversational data, a total of 133 cases of insertion of verbs, 21 adjectives, and 7 adverbs were found. Moreover, 250 nouns were found in 3.5 hours of data, which could be extrapolated to over 450 per 7 hours of conversational data. Firstly, the strategies for the accommodation of Chinese verbs were identified according to the typology of verbal borrowings of Jan Wohlgemuth (2009). The results show four different strategies used to accommodate Chinese verbs: light verb strategy, direct insertion, indirect insertion, and paradigm insertion. The light verb strategy is the most attested in Kazakh, which agrees with Wohlgemuth’s (2009) findings that “language with the dependent-head (OV) orientation strongly prefers the Light Verb Strategy” (p. 203). The light verbs used in Kazakh are qyl- ‘to do’ and bol- ‘to be’, where bol- ‘to be’ is used when the subject of the verb does not have an agentive role and is affected by the action, and qyl- ‘to do’ is used when the subject is agentive, and the action is voluntary. In direct insertion, the Chinese verbs are directly inserted into the Kazakh sentence without any morphological accommodation, but in indirect insertion, Kazakh verbalizing affixes are added to monosyllabic Chinese verbs, as in pī-le ‘to approve’ and zū-la ‘to rent.’ The use of paradigm insertion allows the speakers to use the Chinese perfective aspect marker le together with the Chinese verb. Secondly, since adjectives in Mandarin are regarded as stative verbs (Li & Thompson, 1989), adjectives are accommodated in two different ways – light verb strategy and direct insertion into the position of the Kazakh adjectives. Finally, nouns and adverbs are integrated into Kazakh just like native words without any accommodation strategies. Overall, this study provides an overview of the accommodation of Chinese content words in Kazakh language.