Construction and Publication of the Kundoku Text Corpus of the Kōzanjibon Koōrai and the Owari no kuni Gebumi

Accepted Paper

Zifan Wu (SOKENDAI) Hisashi Yamamoto Toshinobu Ogiso (National Institute for Japanese Language and Linguistics)

Paper short abstract

This paper introduces a corpus based on two Waka-kanbun sources, the Kōzanjibon Koōrai and the Owari no kuni Gebumi. Kunten annotations enable reliable corpus construction as Japanese text. The corpus is encoded in XML with morphological annotation and will be released as “CHJ Wakan Konkōbun”.

Paper long abstract

This presentation reports on the construction and publication of a corpus of kundoku texts based on two historical sources: the Kōzanjibon Koōrai and the Owari no kuni Gebumi. Both are written in Waka-kanbun (Japanized classical Chinese) and are valuable for research on the history of Japanese written styles. However, relatively few Waka-kanbun materials in existing corpora are organized in a form that is readily usable for sustained philological and linguistic analysis.

To address this gap, our project is building a corpus for the Insei period (c. 11th-12th centuries) annotated version of the Kōzanjibon Koōrai and the CE 1325 annotated version of the Owari no kuni Gebumi. Although the base texts themselves are in Waka-kanbun, they preserve rich reading information through kunten marks from the Insei and Kamakura periods (c. 12th-14th centuries), respectively. This makes it possible to establish readings for most passages and to construct them into a Japanese corpus.

The Kōzanjibon Koōrai, preserved in Kōzanji Temple’s repository, is a collection of model letters reflecting the lives of aristocrats and officials in the late Heian period (c. 10th-12th centuries). The manuscript includes katakana kunten added in a consistent hand by an unknown compiler. The Owari no kuni Gebumi is a legal petition issued in CE 988. While the original document no longer survives, this corpus adopts the Shinfukuji Hōshōin manuscript as the base resource and uses its kundoku reading as the corpus text.

The corpus encodes these readings as structured XML, explicitly tagging kunten-related information and scholars’ supplied readings or emendations. In cases of ambiguous readings and orthographic variation, decisions are made based on previous scholarship, while applying normalization necessary for corpus use. We also add morphological annotation following the standards of the National Institute for Japanese Language and Linguistics’ Corpus of Historical Japanese (CHJ), with automatic analyses manually reviewed and corrected. We plan to release it as the sub-corpus “CHJ Wakan Konkōbun”, enabling searches that combine morphology with metadata. We expect this corpus to support comparison with related CHJ materials and to contribute to future research in the history of the Japanese language.

Panel Ling09
Historical corpora
Session 1 Sunday 30 August, 2026, 14:00-15:30