LANG01: Theoretical and methodological issues of language data representation in Central Asia

LANG01

Theoretical and methodological issues of language data representation in Central Asia

Convenor:: Nikolay Mikhailov (Nazarbayev University)
Send message to Convenor

Chair:: Sami Honkasalo (University of Helsinki)

Discussant:: Timofey Arkhangelskiy (Universität Hamburg)

Format:: Panel

Theme:: Language & Linguistics

Location:: William Pitt Union (WPU): room 540

Sessions:: Friday 20 October, 13:30-15:15
Time zone: America/New_York

Abstract

Spoken language corpora have become an essential resource for linguistic research and language technology development, with a strong potential for use in other disciplines, such as anthropology, sociology, history among others. In a way, documenting language is documenting life - we express our knowledge and information about life through the language, and the many facets of the language make it a rich documentation material. Creating a spoken language corpus involves collecting, transcribing, and annotating large amounts of spoken language data, which presents several methodological, technical, ethical, and linguistic challenges. While some of the challenges become resolved with the development of technology, others arise associated with data quality and quantity, ease of automated processing, representativeness and information accessibility.

This panel aims to bring together researchers and practitioners experienced in designing, building, managing, and utilizing spoken corpora of Central Eurasian languages in a variety of projects. The panelists will discuss the state-of-the-art in theory and methodologies for spoken language corpora design, challenges and the solutions in their implementation.

The topics that will be covered in this panel include, but are not limited to:

- Methods for collecting spoken language data, data types, media, equipment, workflows, data sampling methods and storage.

- Transcription and annotation of spoken language data, manual vs. automated, segmentation, orthographic vs. phonetic transcription, annotation schemes and conventions.

- Challenges in creating spoken language corpora, such as data types selection, speaker and genre diversity, regional and social variation, transcription and annotation errors, and ethical considerations.

- Applications of spoken Central Eurasian language corpora in research, language technology development, education, anthropology, psychology, sociology, history.

- Interdisciplinary potentials of spoken language corpora.

The panelists will share experiences, insights, and recommendations based on recent and ongoing projects on spoken Central Eurasian language corpora, and engage in a discussion with the audience on the opportunities and challenges of the field. The panel will be of interest to researchers, practitioners, and students in linguistics, language technology, psychology, anthropology, sociology, education, history and other related fields.

Accepted papers

Session 1 Friday 20 October, 2023, 13:30-15:15

Methodology of computer-assisted spoken language data processing for Kazakh language: case of Multimedia Corpus of Spoken Kazakh Language

Nikolay Mikhailov (Nazarbayev University) Andrey Filchenko (Nazarbayev University)

Send message to Authors

Paper abstract

In this paper, we will present our methodology of working with the spoken data of a Central Asian language, showcasing Multimedia Corpus of Spoken Kazakh language. The focus of the presentation is data processing automation for a representative corpus of spoken language, its issues and development. We will also outline general relevance of this approach for languages beyond the scope of Kazakhstan.

Building a quality corpus of a spoken language is a challenging task requiring diverse strategies to attain maximal representativeness, and a matching efficient processing capacity. Developing a corpus of spoken, interactional, naturally occurring language carries great potential for research not only for linguistics, but also for a diverse range of social science and humanities disciplines including anthropology, sociology, history, among others. The interdisciplinarity of the approach to data definition and collection makes the project a high value contribution to the scholarship on Central Eurasia. However, the nature of the data brings about several challenges which are important to acknowledge, as they point towards areas of development in theory, methodology and practical applications.

First such problem is the speech overlap – speakers talking over each other is frequent in natural interactional speech, and while humans have adapted to it, machines have not. We will discuss our approaches to this problem and outline the state of the art, its results with regards to naturally occurring conversations, and possible paths to solutions. This problem is particularly salient for the project, as the volume of spoken data required makes it highly impractical to transcribe everything manually.

Considering the multilingual nature of Central Asia, code-switching is an expected prominent phenomenon, with speakers using several languages within the same utterance or segment of discourse. This also comprises a challenge for modern speech-to-text software, that often finds itself insufficiently equipped to deal with this otherwise common natural phenomenon. As the aforementioned overlap problem, this also impacts the ability of the project to be effective in data processing, and as such, we will present ways in which we plan to address this issue.

The approaches used within this project are not tied to a particular language, which makes them versatile for anyone interested in studying Central Eurasia from the perspective of natural spoken language and the dominant discourses. We aim to provide a venue for the discussion and tools for such a research, contributing to the development of scholarship in and beyond the Central Eurasia region.

What does a corpus represent? An ethnographic perspective on corpus design for Kazakh

Giorgia Troiani (UC Santa Barbara)

Send message to Author

Paper abstract

For researchers interested in any given language, a corpus of linguistic data that represents the language well is more and more considered an essential tool. But if no corpus exists of the desired type – for example, no corpus of everyday conversation – the question arises: “Can we build one?” Practically, the next two questions are likely to be, “What are the best practices?” and “How much does it cost?” But first, a more basic question must be answered: “What will the corpus be used for?” Constructing a spoken corpus is expensive and time-consuming, and it is easier to justify if the corpus is built to last, serving the needs of a variety of users far into the future.

The history of modern Kazakhstan is complicated, reflecting its position along the traditional trade routes and migratory pathways of Central Asia, as well as the legacy of Soviet educational policies, and the forced relocation of ethnic groups. The result is a complex linguistic landscape with fully institutionalized multilingualism on a grand scale. The coexistence of multiple official languages and heritage languages gives rise to contrasting definitions of fluency, where the question, “Who is a speaker?” may translate to “Who is to be represented?” In such a context, adopting an ethnographic perspective can help make sense of the linguistic and sociocultural complexity.

In this paper, we present the concept of corpus ethnography, arguing that the most effective representation of a language is one based on representing naturally occurring language use, as it emerges from the intrinsic motivations of language users engaged in the pursuit of their social life. The ethnographic perspective motivates certain design decisions for corpus construction, shaping the preferred methods for recording, transcribing, and representing language in use. We present examples drawn from our experience as contributors to the design and construction of corpora of both high and low resource languages, including especially (with a large team from Nazarbayev University) the Multimodal Corpus of Spoken Kazakh Language, as well as the Santa Barbara Corpus of Spoken American English, the Corpus of Sakapultek Maya Narrative and Conversation. We present recordings of Kazakh to show how prioritizing the participants’ own motivations for interaction over purely structural linguistic criteria leads to an organic representation of language in everyday life. Speech genres range from everyday conversation to genres such as ritual that might be excluded from a traditional linguistic corpus.

Supporting minoritised languages through language technology: what's needed (for Turkic) and why bother?

Jonathan Washington (Swarthmore College)