Introduction to MCSKL, the first multimedia corpus of spoken Kazakh languages: challenges and possibilities.

Accepted Paper

Nikolay Mikhailov (Nazarbayev University) Giorgia Troiani (UC Santa Barbara) Andrey Filchenko (Nazarbayev University)

Abstract

Despite the rich cultural heritage of the Kazakh language, there exists a significant gap in linguistic resources for it, particularly in the domain of spoken corpora. This presentation introduces the spoken corpus of the Kazakh language. Although a number of corpora projects exist declaring spoken corpora as their components, MultiCorSKL is the first-ever corpus focusing on genuine naturally occurring spoken discourse, addressing a critical need for such data and analysis in computational linguistics and language diversity preservation.

The project created a comprehensive spoken corpus of modern Kazakh language using a team of trained linguists as well as innovative crowd-sourcing techniques to work with the organic interactional language data. Apart from aiming to collect representative spoken language data, we sought to implement reliable methods of annotation of conversational data, facilitating its use in training automated speech recognition (ASR) models, which has not been done with great success up to this point.

Through the crowd-sourcing approach, we engaged speakers of the Kazakh language across diverse regions and social strata to contribute to the corpus. The collected data underwent annotation, at various levels, from simple orthography to deep annotation using Intonation Unit segmentation, orthographic transcripts, Discourse Functional Transcription, IPA notation, morphemic glossing, integrated within ELAN, enabling detailed and accurate linguistic analysis.

In the most recent audit, the corpus contains 180 hours of recorded data, with 80 hours of data annotated, and the data volume is expected to grow significantly through crowd-sourced recording and automation of annotation. Conversational ASR, an important workflow element, used the original project data, segmented into intonation units, resulting in approximately 30000 files, to train and test the models.

The development of the MultiCorSKL represents a landmark achievement in the studies of low-resource languages. It not only enables advanced research in computational and corpus linguistics but also holds profound implications for cultural preservation and technological advancements in Kazakhstan. Various disciplinary domains such as socio- and cognitive linguistics, conversational analysis, comparative morphosyntax, among others will benefit from the access to spoken corpus data, which provides reliable empirical insights into the spontaneous, day-to-day language usage.

Future work will focus on expanding the corpus and refining ASR models to enhance linguistic accessibility and digital inclusivity. We invite researchers and technologists to explore the corpus and contribute to its expansion. Collaboration is essential to harness the full potential of this unique resource for linguistic research in various areas and technological innovation.

Panel T08LANG
Exploring Low-Resource Languages through Corpus Work: Challenges, Innovations, and Insights
Session 1 Thursday 6 June, 2024, 13:00-14:45