Building Specialized Japanese Language Corpora for Tourism: An AI-Augmented Approach to Foreign Language Learning

Accepted Paper

Irena Srdanović (Juraj Dobrila University of Pula) Dražen Brščić (Kyoto University)

Paper short abstract

This presentation introduces a specialized Japanese corpus for Croatian tourism, based on authentic recordings of Japanese-speaking guides and translators across Croatia. It outlines corpus design, transcription, keyword extraction, and planned use in an AI-augmented VIRAI educational application.

Paper long abstract

This presentation introduces the development of JaTGuideCro-ja, a specialized Japanese language corpus focused on tourism in Croatia, and its application in terminology extraction, lexical analysis, and AI-augmented corpus linguistics. The corpus is based on authentic audio and video recordings of guided tours conducted in Japanese at multiple Croatian destinations.

The data were collected between 2019 and 2023 during on-site simulated and virtual tourist tours organized as practical training of Japanese language students for future professions. Licensed tourist guides and translators conducted tours in Japanese at key cultural and historical locations across Croatia, including Istria, Dalmatia, Zagreb, and surrounding regions. In total, approximately 30 hours of material were recorded from six on-site and six virtual tours, covering both tangible and intangible cultural heritage.

The methodology involved several transcription and analysis steps. Speaker diarization was first performed using the open-source toolkit pyannote.audio (Bredin et al., 2020), as the recordings contained speech in multiple languages, followed by automatic speech recognition (ASR) using Whisper by OpenAI (Radford et al., 2023). Selected recordings from Istrian locations (Pula, Opatija, Barban, and surrounding towns) were then morphologically analyzed using Japanese language tools within Sketch Engine (Kilgarriff et al., 2004; Srdanović et al., 2008), resulting in a Japanese language corpus that enabled corpus construction, comparison, and keyword extraction using frequency-based and statistical analyses.

A comparative analysis with the large-scale Japanese web corpus JaTenTen11 identified three vocabulary categories: specialized terms absent from general corpora but essential in the Croatian tourism context; terms shared by both corpora but used with domain-specific meanings; and general-purpose vocabulary. Lemma extraction also revealed limitations in existing Japanese morphological analyzers, particularly regarding place names, culture-specific terminology, and non-Japan-centered lexicon.

The corpus forms the basis for an AI-augmented Japanese language learning platform under development within the VIRAI project. A prototype language tutor and cultural tour application, tested using content related to the city of Pula and its attraction, the Arena, a Roman-period amphitheater, received positive feedback for usability and educational potential (Srdanović et al., 2025). Overall, the research demonstrates the value of specialized corpora for language learning, professional training, and applied linguistic research through AI-enhanced tools.

Panel Dh01
Interdisciplinary Section: Digital Humanities individual proposals panel
Session 1 Friday 28 August, 2026, 16:00-17:30