Accepted Paper

From Japanese NLP to Endangered Language Documentation:Automatic Gloss Generation as Digital Infrastructure  
Nanami Hamada (Kyushu University)

Send message to Author

Paper short abstract

This paper presents an automatic interlinear gloss generation framework using existing Japanese NLP tools, developed as a digital infrastructure for endangered language documentation within the Japonica language family.

Paper long abstract

This paper proposes an automatic interlinear gloss generation framework designed as a digital humanities infrastructure for endangered language documentation. The long-term goal of this project is to support community-centered documentation of endangered Japonic languages, particularly Ryukyuan varieties, by reducing the technical barriers associated with manual glossing.

As a foundational step, we implement and evaluate automatic gloss generation for Standard Japanese using existing Japanese natural language processing tools. Rather than developing a new morphological analyzer, the proposed system repurposes established NLP outputs—part-of-speech tags, dependency relations, and morphological features—and converts them into linguistically interpretable interlinear glosses following the Leipzig Glossing Rules. The framework distinguishes roots, clitics, and affixes, and generates gloss labels such as =TOP, =POL.NPST, and –PST without relying on predefined dictionaries.

This approach positions Japanese not as the primary object of study but as a high-resource testbed for constructing reusable documentation infrastructure. By evaluating the system against manually annotated data, we identify both the potential and the limitations of existing NLP tools when applied to humanistic annotation tasks. The results demonstrate that a substantial portion of gloss-level information can be derived deterministically from syntactic and morphological cues, while highlighting challenges in auxiliary sequencing and particle disambiguation.

From a digital humanities perspective, this study reframes interlinear glossing as a form of structured cultural knowledge representation rather than a purely linguistic task. The proposed framework emphasizes transparency, reproducibility, and extensibility, making it suitable for adaptation to endangered language contexts where linguistic expertise and resources are limited. Ultimately, this work contributes a methodological bridge between computational linguistics and documentary practice, advancing digital infrastructures that enable broader participation in the preservation of linguistic and cultural heritage.

Panel INDDIGI001
Interdisciplinary Section: Digital Humanities individual proposals panel
  Session 1