Ling13: From BCCWJ to BCCWJ2: Building the Next Generation Balanced Corpus of Contemporary Japanese

Ling13

From BCCWJ to BCCWJ2: Building the Next Generation Balanced Corpus of Contemporary Japanese

Convenors:: Toshinobu Ogiso (National Institute for Japanese Language and Linguistics)
Makoto Yamazaki (National Institute for Japanese Language)
Send message to Convenors

Discussant:: Irena Srdanović (Juraj Dobrila University of Pula)

Format:: Panel

Section:: Language and Linguistics

Location:: 4.17

Sessions:: Friday 28 August, 14:00-15:30
Time zone: Europe/Warsaw

Add to Calendar:

Short Abstract

This panel introduces BCCWJ2, an extension of the Balanced Corpus of Contemporary Written Japanese, through its design principles, current implementation and demonstration, lexical analyses based on the newly constructed data, and exploratory analyses of SNS data planned for future inclusion.

Long Abstract

This panel presents BCCWJ2, an extension of the Balanced Corpus of Contemporary Written Japanese (BCCWJ), which has served as a core reference corpus for research on modern Japanese since its initial release. BCCWJ2 aims to expand the corpus from 100 million to approximately 200 million words by adding large-scale book data published between 2006 and 2025, thereby extending its temporal coverage while preserving continuity with the design principles of the original BCCWJ.

The panel brings together researchers directly involved in the development and analysis of BCCWJ2 to discuss its design, current implementation, and future directions. After a brief overview situating BCCWJ2 within the broader context of Japanese corpus development, the panel consists of three main presentations.

The first presentation focuses on the design of BCCWJ2, addressing key issues such as sampling strategies, genre balance, metadata structure, and annotation policies. It also includes a demonstration of the portion of the corpus already constructed, which is scheduled for public release via Chunagon, allowing users to explore the data in continuity with the original BCCWJ.

The second presentation examines the lexical characteristics of BCCWJ2, comparing vocabulary usage in the newly added data with that of earlier periods represented in BCCWJ. Special attention is paid to lexical items that have emerged or become prominent since 2006, based on a comparison between the original BCCWJ and the 2006–2010 portion of BCCWJ2, illustrating the potential of the extended corpus for studying recent lexical change.

The third presentation looks ahead to future expansion by analyzing SNS data planned for inclusion in BCCWJ2. It discusses the linguistic features of this genre and the methodological and conceptual challenges involved in incorporating socially generated texts into a balanced reference corpus.

Through these contributions, the panel aims to stimulate discussion on how large-scale reference corpora can be sustainably extended, how balance and representativeness should be reconsidered in light of new genres and media, and how BCCWJ2 can support next-generation research on contemporary Japanese.

Abstract in Japanese (if needed)

Accepted papers

Session 1 Friday 28 August, 2026, 14:00-15:30

Incorporating Social Media Data into a Balanced Reference Corpus: Design Challenges in BCCWJ2

Kanato Ochiai (National Institute for Japanese Language and Linguistics)

Send message to Author

Paper short abstract

This paper examines challenges in incorporating social media data into BCCWJ2, an extension of the Balanced Corpus of Contemporary Japanese. Using data from microblog-style platforms, it discusses issues of representativeness and metadata design in building a balanced reference corpus.

Paper long abstract

This paper examines the challenges involved in incorporating social media data into BCCWJ2, an ongoing extension of the Balanced Corpus of Contemporary Japanese. While texts produced on social media have become a central component of contemporary written communication, their inclusion in a public reference corpus raises methodological, conceptual, and ethical issues that differ substantially from those associated with traditional published materials.

The paper focuses on microblog-style platforms such as Bluesky and Misskey, which are structurally comparable to Twitter (X) and are currently being considered as candidates for inclusion in BCCWJ2. After outlining the current state of large-scale data collection from these platforms, the paper discusses key issues related to platform selection, data volume, and representativeness. Given that social media data can be collected on a scale far exceeding that of the corpus as a whole, careful selection and sampling are required in order to maintain balance within BCCWJ2.

Particular attention is paid to criteria for selecting social media texts, including the distribution of post types (e.g. original posts, replies, automated posts), temporal dispersion to mitigate event-driven bias, and the potential overrepresentation of language use by a small number of highly active users. The paper also addresses the treatment of metadata, considering which information should be retained within an XML-based corpus structure to support linguistic research while avoiding unnecessary personal identification.

By highlighting these issues, the paper argues that incorporating social media data into a balanced reference corpus is not simply a matter of increasing data volume. Rather, it requires reconsideration of fundamental design principles such as balance, representativeness, and usability. The discussion aims to contribute to broader debates on how large-scale reference corpora can adapt to new forms of digitally mediated language use.

Designing BCCWJ2: Sampling, Metadata, Annotation, and User Interface with a Demonstration

NINGCHEN WU (National Institute for Japanese Language and Linguistics)

Send message to Author

Paper short abstract

This talk outlines the design of BCCWJ2, a 200-million-word expansion of BCCWJ. We describe NDC-stratified annual book sampling (5M words/year; ~1,000 books), streamlined metadata/annotation and copyright-aware policies, and demo the release data in Chunagon.

Paper long abstract

This presentation reports on the design of BCCWJ2, the ongoing expansion of the Balanced Corpus of Contemporary Written Japanese (BCCWJ) from approximately 100 million to 200 million words. The core addition is a large-scale collection of book data published between 2006 and 2025, sampled at a fixed rate of 5 million words per year. To preserve the principle of balance while ensuring transparency and replicability, sampling is stratified by the Nippon Decimal Classification (NDC), resulting in approximately 1,000 sampled books per year.

We focus on design choices that balance continuity with BCCWJ1 and practical feasibility under current conditions. First, we explain the revised sampling workflow: BCCWJ2 defines the population and sampling unit at the book level, simplifying procedures compared to earlier page- or character-based randomization while keeping NDC-based stratification. Because complete bibliographic information for the entire 2006–2025 period was not available at the outset of the project, the sampling frame is operationalized year by year, allowing steady annual growth and timely releases.

Second, we outline metadata and annotation policies. To improve construction efficiency while maintaining usability, we prioritize the metadata items exposed in the Chunagon interface and streamline less-used structural tag sets. The talk also introduces new metadata for safer secondary use (e.g., adult-content filtering in an rated R scheme) and explains how these decisions support research and educational contexts.

Third, we address copyright-aware design. Building on Japan’s post-2018 legal environment for text and data mining (TDM), BCCWJ2 adopts operational constraints such as excluding short poetic works and carefully controlling the length of contexts displayed in web services.

The presentation concludes with a live demonstration of the currently constructed subset (focusing on the 2006–2010 portion) in Chunagon, showing continuity with BCCWJ1 workflows and illustrating how users in and outside Japan can access and make use of the expanded corpus.

Abstract in Japanese (if needed):

本発表では、「現代日本語書き言葉均衡コーパス（BCCWJ）」を約1億語から約2億語へ拡張する BCCWJ2 の設計方針と実装状況を報告する。BCCWJ2 では 2006–2025 年刊行の出版書籍を中核追加データとし、年 500 万語の固定量を目標に、NDC（日本十進分類法）に基づく層別サンプリングによって毎年約 1,000 冊を抽出する。BCCWJ1 の設計思想との継続性を確保しつつ、構築を円滑に進めるため、母集団・抽出単位を「書籍」ベースに整理し、年次ごとにサンプリング枠を設定する運用とする。あわせて、(1) ジャンルバランスと出版状況変化（刊行点数の推移等）を踏まえた抽出率の管理、(2) 中納言で提示されるメタ情報を優先したメタデータ設計と、利用実態を踏まえたタグ／注釈の合理化、(3) 教育・公共利用を想定した R18 等の新規メタデータ付与、(4) 2018 年以降の法制度を前提とした運用上の制約（短詩形作品の非収録、検索サービスでの文脈長制御）を述べる。最後に、既に構築済みで段階的に公開したデータ（2006–2010 年分）について、中納言（Chunagon）上での検索・閲覧をデモし、BCCWJ1 と連続した利用体験の中で BCCWJ2 をどのように活用できるかを示す。

Lexical Change in Contemporary Japanese: Insights from Comparing BCCWJ2 with BCCWJ

ASUKO KONDO (National Institute for Japanese Language and Linguistics)

Send message to Author

Paper short abstract

This presentation outlines the morphological annotation of BCCWJ2 and compares it with earlier BCCWJ data to identify newly emerged words and significant frequency changes, demonstrating how BCCWJ2 supports the analysis of diachronic lexical change in contemporary Japanese.

Paper long abstract

This presentation provides an overview of the morphological annotation in BCCWJ2 and examines the lexical characteristics of BCCWJ2 through a comparison with data from earlier periods represented by BCCWJ. Using this morphological information, we extract lexical items that have newly emerged or have shown statistically significant frequency changes, thereby clarifying distinctive features of the expanded corpus.

BCCWJ2 adopts two types of word units, short-unit words and long-unit words, and annotates each with morphological information. Short-unit words are linguistic units defined with a focus on morphological structure; they are characterized by clear criteria and minimal variation in segmentation. Morphological information for short-unit words is annotated by applying manual corrections to the results of automatic morphological analysis using the UniDic morphological dictionary. Long-unit words, by contrast, are defined with a focus on syntactic structure, and their morphological information is constructed by combining the morphological information of short-unit words and applying manual corrections to the output of a newly developed long-unit analyzer. All morphological annotations in BCCWJ2 follow the same specifications as those used in BCCWJ, enabling direct and reliable lexical comparison between the two corpora.

In the latter part of the presentation, we focus on short-unit morphological information to compare data from the 2006–2010 portion of BCCWJ2 with data from earlier periods in BCCWJ. Specifically, we compare lexical frequencies across the two periods using statistical measures in order to identify lexical items that have newly emerged as well as those that have become markedly more or less frequent. Based on these results, we discuss how social and technological changes have influenced Japanese vocabulary in the late 2000s (2006–2010), and how processes of lexical diffusion, stabilization, and decline are reflected in the corpus. Through this analysis, we demonstrate that BCCWJ2 provides an effective foundation for research into diachronic changes in contemporary Japanese vocabulary.