Click the star to add/remove an item to/from your individual schedule.
You need to be logged in to avail of this functionality.
Log in
- Convenor:
-
Nikolay Mikhailov
(Nazarbayev University)
Send message to Convenor
- Chair:
-
Sami Honkasalo
(University of Helsinki)
- Discussant:
-
Timofey Arkhangelskiy
(Universität Hamburg)
- Format:
- Panel
- Theme:
- Language & Linguistics
- Location:
- William Pitt Union (WPU): room 540
- Sessions:
- Friday 20 October, -
Time zone: America/New_York
Abstract:
Spoken language corpora have become an essential resource for linguistic research and language technology development, with a strong potential for use in other disciplines, such as anthropology, sociology, history among others. In a way, documenting language is documenting life - we express our knowledge and information about life through the language, and the many facets of the language make it a rich documentation material. Creating a spoken language corpus involves collecting, transcribing, and annotating large amounts of spoken language data, which presents several methodological, technical, ethical, and linguistic challenges. While some of the challenges become resolved with the development of technology, others arise associated with data quality and quantity, ease of automated processing, representativeness and information accessibility.
This panel aims to bring together researchers and practitioners experienced in designing, building, managing, and utilizing spoken corpora of Central Eurasian languages in a variety of projects. The panelists will discuss the state-of-the-art in theory and methodologies for spoken language corpora design, challenges and the solutions in their implementation.
The topics that will be covered in this panel include, but are not limited to:
- Methods for collecting spoken language data, data types, media, equipment, workflows, data sampling methods and storage.
- Transcription and annotation of spoken language data, manual vs. automated, segmentation, orthographic vs. phonetic transcription, annotation schemes and conventions.
- Challenges in creating spoken language corpora, such as data types selection, speaker and genre diversity, regional and social variation, transcription and annotation errors, and ethical considerations.
- Applications of spoken Central Eurasian language corpora in research, language technology development, education, anthropology, psychology, sociology, history.
- Interdisciplinary potentials of spoken language corpora.
The panelists will share experiences, insights, and recommendations based on recent and ongoing projects on spoken Central Eurasian language corpora, and engage in a discussion with the audience on the opportunities and challenges of the field. The panel will be of interest to researchers, practitioners, and students in linguistics, language technology, psychology, anthropology, sociology, education, history and other related fields.
Accepted papers:
Session 1 Friday 20 October, 2023, -Paper abstract:
In this paper, we will present our methodology of working with the spoken data of a Central Asian language, showcasing Multimedia Corpus of Spoken Kazakh language. The focus of the presentation is data processing automation for a representative corpus of spoken language, its issues and development. We will also outline general relevance of this approach for languages beyond the scope of Kazakhstan.
Building a quality corpus of a spoken language is a challenging task requiring diverse strategies to attain maximal representativeness, and a matching efficient processing capacity. Developing a corpus of spoken, interactional, naturally occurring language carries great potential for research not only for linguistics, but also for a diverse range of social science and humanities disciplines including anthropology, sociology, history, among others. The interdisciplinarity of the approach to data definition and collection makes the project a high value contribution to the scholarship on Central Eurasia. However, the nature of the data brings about several challenges which are important to acknowledge, as they point towards areas of development in theory, methodology and practical applications.
First such problem is the speech overlap – speakers talking over each other is frequent in natural interactional speech, and while humans have adapted to it, machines have not. We will discuss our approaches to this problem and outline the state of the art, its results with regards to naturally occurring conversations, and possible paths to solutions. This problem is particularly salient for the project, as the volume of spoken data required makes it highly impractical to transcribe everything manually.
Considering the multilingual nature of Central Asia, code-switching is an expected prominent phenomenon, with speakers using several languages within the same utterance or segment of discourse. This also comprises a challenge for modern speech-to-text software, that often finds itself insufficiently equipped to deal with this otherwise common natural phenomenon. As the aforementioned overlap problem, this also impacts the ability of the project to be effective in data processing, and as such, we will present ways in which we plan to address this issue.
The approaches used within this project are not tied to a particular language, which makes them versatile for anyone interested in studying Central Eurasia from the perspective of natural spoken language and the dominant discourses. We aim to provide a venue for the discussion and tools for such a research, contributing to the development of scholarship in and beyond the Central Eurasia region.
Paper abstract:
For researchers interested in any given language, a corpus of linguistic data that represents the language well is more and more considered an essential tool. But if no corpus exists of the desired type – for example, no corpus of everyday conversation – the question arises: “Can we build one?” Practically, the next two questions are likely to be, “What are the best practices?” and “How much does it cost?” But first, a more basic question must be answered: “What will the corpus be used for?” Constructing a spoken corpus is expensive and time-consuming, and it is easier to justify if the corpus is built to last, serving the needs of a variety of users far into the future.
The history of modern Kazakhstan is complicated, reflecting its position along the traditional trade routes and migratory pathways of Central Asia, as well as the legacy of Soviet educational policies, and the forced relocation of ethnic groups. The result is a complex linguistic landscape with fully institutionalized multilingualism on a grand scale. The coexistence of multiple official languages and heritage languages gives rise to contrasting definitions of fluency, where the question, “Who is a speaker?” may translate to “Who is to be represented?” In such a context, adopting an ethnographic perspective can help make sense of the linguistic and sociocultural complexity.
In this paper, we present the concept of corpus ethnography, arguing that the most effective representation of a language is one based on representing naturally occurring language use, as it emerges from the intrinsic motivations of language users engaged in the pursuit of their social life. The ethnographic perspective motivates certain design decisions for corpus construction, shaping the preferred methods for recording, transcribing, and representing language in use. We present examples drawn from our experience as contributors to the design and construction of corpora of both high and low resource languages, including especially (with a large team from Nazarbayev University) the Multimodal Corpus of Spoken Kazakh Language, as well as the Santa Barbara Corpus of Spoken American English, the Corpus of Sakapultek Maya Narrative and Conversation. We present recordings of Kazakh to show how prioritizing the participants’ own motivations for interaction over purely structural linguistic criteria leads to an organic representation of language in everyday life. Speech genres range from everyday conversation to genres such as ritual that might be excluded from a traditional linguistic corpus.
Paper abstract:
One type of challenge faced by speakers of minoritised and endangered languages in language maintenance and revitalisation efforts is the lack of support for the language on digital platforms. This paper overviews a range of major impediments of this type, and discusses what approaches are available to overcome them, with examples drawn primarily from languages of Central Eurasia. Challenges range from having the language recognised with an ISO 639 code to developing personal voice assistants in the language, and everything in between (localisation, input methods, spell checking, machine translation). Such tools are important because they add to the ease with which a language can be used in digital contexts, which has the potential to increase both the range of uses and the perceptions of the language, both of which are crucial for maintenance and revitalisation. Corpus-based approaches, which leverage existing corpora, and symbolic approaches, whose use is a form of linguistic documentation—as well as hybrid approaches—offer various advantages and disadvantages for implementing these kinds of language technology, but the major impediments lie elsewhere. The position of each modern Turkic language in relation to the availability of and barriers to language technology is surveyed, to identify priorities for future language technology work.