Accepted Paper
Paper short abstract
This paper examines challenges in incorporating social media data into BCCWJ2, an extension of the Balanced Corpus of Contemporary Japanese. Using data from microblog-style platforms, it discusses issues of representativeness and metadata design in building a balanced reference corpus.
Paper long abstract
This paper examines the challenges involved in incorporating social media data into BCCWJ2, an ongoing extension of the Balanced Corpus of Contemporary Japanese. While texts produced on social media have become a central component of contemporary written communication, their inclusion in a public reference corpus raises methodological, conceptual, and ethical issues that differ substantially from those associated with traditional published materials.
The paper focuses on microblog-style platforms such as Bluesky and Misskey, which are structurally comparable to Twitter (X) and are currently being considered as candidates for inclusion in BCCWJ2. After outlining the current state of large-scale data collection from these platforms, the paper discusses key issues related to platform selection, data volume, and representativeness. Given that social media data can be collected on a scale far exceeding that of the corpus as a whole, careful selection and sampling are required in order to maintain balance within BCCWJ2.
Particular attention is paid to criteria for selecting social media texts, including the distribution of post types (e.g. original posts, replies, automated posts), temporal dispersion to mitigate event-driven bias, and the potential overrepresentation of language use by a small number of highly active users. The paper also addresses the treatment of metadata, considering which information should be retained within an XML-based corpus structure to support linguistic research while avoiding unnecessary personal identification.
By highlighting these issues, the paper argues that incorporating social media data into a balanced reference corpus is not simply a matter of increasing data volume. Rather, it requires reconsideration of fundamental design principles such as balance, representativeness, and usability. The discussion aims to contribute to broader debates on how large-scale reference corpora can adapt to new forms of digitally mediated language use.
From BCCWJ to BCCWJ2: Building the Next Generation Balanced Corpus of Contemporary Japanese