Accepted Paper

Image Retrieval in Japanese Classical Documents using Deep Learning Method  
Satoru Fujita (Hosei University) Keizo Oyama (National Institute of Japanese Literature) Shin'ichi Satoh

Send message to Authors

Paper short abstract

Japanese classical documents often contain numerous illustrations embedded in the margins. This paper presents a deep learning–based method that enables efficient retrieval of such illustrations using natural language queries. The method supports advanced historical research in the humanities.

Paper long abstract

Japanese classical documents often contain numerous illustrations embedded in the margins or across entire pages. While a few of these illustrations are well known, the majority remain largely unexplored. This paper presents a deep learning–based method for efficiently retrieving such illustrations from large-scale digital libraries using natural language queries. Our method employs CLIP (Contrastive Language–Image Pre-training), which learns joint text–image feature representations and enables users to retrieve relevant images based on natural language descriptions.

Several contributors provide CLIP models trained on Japanese texts and images; however, applying these models to Japanese classical documents requires additional adaptation. First, we fine-tuned the model to better recognize classical Japanese terms, including “kichou,” a type of partitioning curtain, and "shitomi," a type of gate board, both of which frequently depicted in historical materials. Second, to address CLIP’s difficulty in detecting small objects within high-resolution page images, we implemented a preprocessing step that identifies small items, such as dishes or instruments, and registers them as individual sub-images. Because this process significantly increases the total number of images by multiplying the number of sub-images per original page, we further designed a fast similarity computation method to maintain interactive retrieval performance. We further introduce a method for refining retrieved images through additional conditional queries, including color-related constraints. In this case, each query is represented as a combination of a reference image and natural-language modifiers.

The proposed approach enables users to rapidly discover illustrations of various sizes, styles, and colors across extensive digital libraries. It supports intuitive exploration of classical materials and contributes to advanced historical and humanities research.

Panel T0132
Data-Driven Humanities in Japan: From AI Infrastructure to Computational Literary Analysis