Methodology of computer-assisted spoken language data processing for Kazakh language: case of Multimedia Corpus of Spoken Kazakh Language

Accepted Paper

Nikolay Mikhailov (Nazarbayev University) Andrey Filchenko (Nazarbayev University)

Send message to Authors

Paper abstract

In this paper, we will present our methodology of working with the spoken data of a Central Asian language, showcasing Multimedia Corpus of Spoken Kazakh language. The focus of the presentation is data processing automation for a representative corpus of spoken language, its issues and development. We will also outline general relevance of this approach for languages beyond the scope of Kazakhstan.

Building a quality corpus of a spoken language is a challenging task requiring diverse strategies to attain maximal representativeness, and a matching efficient processing capacity. Developing a corpus of spoken, interactional, naturally occurring language carries great potential for research not only for linguistics, but also for a diverse range of social science and humanities disciplines including anthropology, sociology, history, among others. The interdisciplinarity of the approach to data definition and collection makes the project a high value contribution to the scholarship on Central Eurasia. However, the nature of the data brings about several challenges which are important to acknowledge, as they point towards areas of development in theory, methodology and practical applications.

First such problem is the speech overlap – speakers talking over each other is frequent in natural interactional speech, and while humans have adapted to it, machines have not. We will discuss our approaches to this problem and outline the state of the art, its results with regards to naturally occurring conversations, and possible paths to solutions. This problem is particularly salient for the project, as the volume of spoken data required makes it highly impractical to transcribe everything manually.

Considering the multilingual nature of Central Asia, code-switching is an expected prominent phenomenon, with speakers using several languages within the same utterance or segment of discourse. This also comprises a challenge for modern speech-to-text software, that often finds itself insufficiently equipped to deal with this otherwise common natural phenomenon. As the aforementioned overlap problem, this also impacts the ability of the project to be effective in data processing, and as such, we will present ways in which we plan to address this issue.

The approaches used within this project are not tied to a particular language, which makes them versatile for anyone interested in studying Central Eurasia from the perspective of natural spoken language and the dominant discourses. We aim to provide a venue for the discussion and tools for such a research, contributing to the development of scholarship in and beyond the Central Eurasia region.

Panel LANG01
Theoretical and methodological issues of language data representation in Central Asia
Session 1 Friday 20 October, 2023, 13:30-15:15