T0422: Adapting Artificial Intelligence for Creating Spoken Language Corpora: The Case of Kazakh

T0422

Adapting Artificial Intelligence for Creating Spoken Language Corpora: The Case of Kazakh

Author:: Nikolay Mikhailov (Nazarbayev University)
Send message to Author

Format:: Individual paper

Theme:: Language & Linguistics

Abstract

The rapid development of artificial intelligence (AI) technologies often overlooks low-resource languages, potentially leading to "digital language death" and preventing speakers from accessing resources and knowledge. This project aims to counter this trend by developing resources based on natural spoken Kazakh using AI processing, ultimately establishing a replicable workflow that can be applied to other low-resource languages.

A significant hurdle in automated corpus creation is transcribing naturally occurring interactional speech events (NOISE). These speech events are typically messy due to high noise-to-signal ratios, deficient articulation, simultaneous speakers, and code-switching. Most existing speech-to-text (STT) models, which are often trained on read-aloud written prompts rather than conversational data, perform poorly on NOISE. To address this, the first step in our workflow is to fine-tune the Whisper STT model specifically for conversational Kazakh. This data is sourced from the Multimedia Corpus of Spoken Kazakh Language, which contains roughly 70 hours of annotated data. To better reflect authentic speech, the source data is intentionally left noisy with minimal cleanup. Issues such as varying audio lengths and multi-speaker overlap are managed through data padding, sequential file combination, and neural network-based speaker separation.

The second phase of the workflow transitions from standard text transcription to segmentation into Intonation Units (IUs). IUs represent speech more naturally and align better with human cognitive processes than forcing speech into strict written norms. To achieve this conversion, we are exploring audio processing using a separate model, as well as an alternative method based on regression analysis. This alternative approach hypothesizes that the deltas of specific speech features at intonation boundaries can serve as accurate predictors for IU segmentation.

In the final step, the processed data is integrated into ELAN, converting the output into search system-indexable XML files with hierarchical tier structures. Because IUs present challenges for standard search architectures, a flexible Solr-based corpus search system is currently in development to accommodate them. By leveraging available conversational data to generate more annotated data efficiently, this project not only elevates the status of Kazakh but also provides a vital methodological framework for broader linguistic research.

This paper is intended to be presented at panel: «Voices of Steppe and Taiga - Bridging the Digital Divide: Language Documentation and Resource Development for the Languages of Central and Northern Asia.»