Designing BCCWJ2: Sampling, Metadata, Annotation, and User Interface with a Demonstration

Accepted Paper

NINGCHEN WU (National Institute for Japanese Language and Linguistics)

Paper short abstract

This talk outlines the design of BCCWJ2, a 200-million-word expansion of BCCWJ. We describe NDC-stratified annual book sampling (5M words/year; ~1,000 books), streamlined metadata/annotation and copyright-aware policies, and demo the release data in Chunagon.

Paper long abstract

This presentation reports on the design of BCCWJ2, the ongoing expansion of the Balanced Corpus of Contemporary Written Japanese (BCCWJ) from approximately 100 million to 200 million words. The core addition is a large-scale collection of book data published between 2006 and 2025, sampled at a fixed rate of 5 million words per year. To preserve the principle of balance while ensuring transparency and replicability, sampling is stratified by the Nippon Decimal Classification (NDC), resulting in approximately 1,000 sampled books per year.

We focus on design choices that balance continuity with BCCWJ1 and practical feasibility under current conditions. First, we explain the revised sampling workflow: BCCWJ2 defines the population and sampling unit at the book level, simplifying procedures compared to earlier page- or character-based randomization while keeping NDC-based stratification. Because complete bibliographic information for the entire 2006–2025 period was not available at the outset of the project, the sampling frame is operationalized year by year, allowing steady annual growth and timely releases.

Second, we outline metadata and annotation policies. To improve construction efficiency while maintaining usability, we prioritize the metadata items exposed in the Chunagon interface and streamline less-used structural tag sets. The talk also introduces new metadata for safer secondary use (e.g., adult-content filtering in an rated R scheme) and explains how these decisions support research and educational contexts.

Third, we address copyright-aware design. Building on Japan’s post-2018 legal environment for text and data mining (TDM), BCCWJ2 adopts operational constraints such as excluding short poetic works and carefully controlling the length of contexts displayed in web services.

The presentation concludes with a live demonstration of the currently constructed subset (focusing on the 2006–2010 portion) in Chunagon, showing continuity with BCCWJ1 workflows and illustrating how users in and outside Japan can access and make use of the expanded corpus.

Abstract in Japanese (if needed):

本発表では、「現代日本語書き言葉均衡コーパス（BCCWJ）」を約1億語から約2億語へ拡張する BCCWJ2 の設計方針と実装状況を報告する。BCCWJ2 では 2006–2025 年刊行の出版書籍を中核追加データとし、年 500 万語の固定量を目標に、NDC（日本十進分類法）に基づく層別サンプリングによって毎年約 1,000 冊を抽出する。BCCWJ1 の設計思想との継続性を確保しつつ、構築を円滑に進めるため、母集団・抽出単位を「書籍」ベースに整理し、年次ごとにサンプリング枠を設定する運用とする。あわせて、(1) ジャンルバランスと出版状況変化（刊行点数の推移等）を踏まえた抽出率の管理、(2) 中納言で提示されるメタ情報を優先したメタデータ設計と、利用実態を踏まえたタグ／注釈の合理化、(3) 教育・公共利用を想定した R18 等の新規メタデータ付与、(4) 2018 年以降の法制度を前提とした運用上の制約（短詩形作品の非収録、検索サービスでの文脈長制御）を述べる。最後に、既に構築済みで段階的に公開したデータ（2006–2010 年分）について、中納言（Chunagon）上での検索・閲覧をデモし、BCCWJ1 と連続した利用体験の中で BCCWJ2 をどのように活用できるかを示す。

Panel Ling13
From BCCWJ to BCCWJ2: Building the Next Generation Balanced Corpus of Contemporary Japanese
Session 1 Friday 28 August, 2026, 14:00-15:30