The design and characteristics of the Man'yoshu corpus

Accepted Paper

Tomoaki Kono (The University of Tokyo)

Paper short abstract

Man'yoshu is a collection of Japanese poetry, compiled in the eight century. This anthology includes contemporary dialects and the original texts are written in kanji characters. We designed and are constructing the corpus of Man'yoshu, which enables researchers to study these features.

Paper long abstract

The National Institute for Japanese Language and Linguistics is constructing an annotated diachronic corpus of the Japanese language. As part of this work, we designed and are constructing the corpus of Man'yoshu.

Man'yoshu is a collection of poems, compiled in the late eighth century. The composers consist of various kinds of people: from emperors down to peasants and soldiers. This anthology contains 4,500 poems in 20 volumes and the total number of the words is about 100,000. Some volumes represent contemporary dialects, and the regions from which the authors originated were noted, which enhances the value of Man'yoshu as a linguistic resource. Japanese borrowed kanji to represent their language since there were no Japan-original characters yet.

Our Man'yoshu corpus features the four characteristics below. First, the information on composers and volumes are attached to each poem, which is useful for the study of expressions peculiar to a composer or a volume. Second, the original (kanji) characters are aligned with the transcribed words. This enables researchers to study the individual style of using kanji peculiar to each composer. Third, the text of our corpus is divided into two types of lexical items: short unit words and long unit words. Short unit words are determined by the combinatory patterns of morphemes, with the objective of searching example data. Long unit words are based on phrases, with the objective of examining linguistic properties. These different items are provided so that researchers can use them depending on their purposes. Fourth, morphological information are provided for all of the texts: headwords, part-of-speech classifications, conjugation types, etc. Each lexical item in the corpus is linked to a corresponding entry in an electronic dictionary called UniDic. The entries of UniDic have hierarchical structures consisting of three levels: the lemma, form and orthographic levels. The lemma is like the headword of a general dictionary and is the highest level of the hierarchy. The form level distinguishes different forms and conjugation types while the orthographic level distinguishes variant spellings. Thus the corpus allows researchers to study the variation of dialectal forms or phonologically changed forms in Man'yoshu.

Panel S2_02
Construction and utilisation of the corpus of historical Japanese: Man'yōshū and Christian materials
Session 1