T0144


From BCCWJ to BCCWJ2: Building the Next Generation Balanced Corpus of Contemporary Japanese 
Convenors:
Toshinobu Ogiso (National Institute for Japanese Language and Linguistics)
Makoto Yamazaki (National Institute for Japanese Language)
Send message to Convenors
Discussant:
Irena Srdanović (Juraj Dobrila University of Pula)
Format:
Panel
Section:
Language and Linguistics

Short Abstract

This panel introduces BCCWJ2, an extension of the Balanced Corpus of Contemporary Written Japanese, through its design principles, current implementation and demonstration, lexical analyses based on the newly constructed data, and exploratory analyses of SNS data planned for future inclusion.

Long Abstract

This panel presents BCCWJ2, an extension of the Balanced Corpus of Contemporary Written Japanese (BCCWJ), which has served as a core reference corpus for research on modern Japanese since its initial release. BCCWJ2 aims to expand the corpus from 100 million to approximately 200 million words by adding large-scale book data published between 2006 and 2025, thereby extending its temporal coverage while preserving continuity with the design principles of the original BCCWJ.

The panel brings together researchers directly involved in the development and analysis of BCCWJ2 to discuss its design, current implementation, and future directions. After a brief overview situating BCCWJ2 within the broader context of Japanese corpus development, the panel consists of three main presentations.

The first presentation focuses on the design of BCCWJ2, addressing key issues such as sampling strategies, genre balance, metadata structure, and annotation policies. It also includes a demonstration of the portion of the corpus already constructed, which is scheduled for public release via Chunagon, allowing users to explore the data in continuity with the original BCCWJ.

The second presentation examines the lexical characteristics of BCCWJ2, comparing vocabulary usage in the newly added data with that of earlier periods represented in BCCWJ. Special attention is paid to lexical items that have emerged or become prominent since 2006, based on a comparison between the original BCCWJ and the 2006–2010 portion of BCCWJ2, illustrating the potential of the extended corpus for studying recent lexical change.

The third presentation looks ahead to future expansion by analyzing SNS data planned for inclusion in BCCWJ2. It discusses the linguistic features of this genre and the methodological and conceptual challenges involved in incorporating socially generated texts into a balanced reference corpus.

Through these contributions, the panel aims to stimulate discussion on how large-scale reference corpora can be sustainably extended, how balance and representativeness should be reconsidered in light of new genres and media, and how BCCWJ2 can support next-generation research on contemporary Japanese.

Abstract in Japanese (if needed)

Accepted papers