Three approaches to text mining the Mitsui Mi’ike mine archive

Accepted Paper

Raja Adal (University of Pittsburgh)

Paper short abstract

This presentation describes how questions of access, location, technical expertise, and the politics of memory affect three different approaches to text mining the Mitsui Mi'ke Mine archive: indexing by research assistants, machine learning, and crowdsourcing.

Paper long abstract

This presentation describes the possibilities and challenges of different approaches to text mining. It compares three approaches to mining the Mitsui Mi'ke Mine archive, one of the largest historical archives in modern Japan that includes 20,713 individual documents dating from 1889 to 1940. The three approaches are: 1) The creation of an index of the archive using a team of research assistants. The challenges of this approach included funding including issues of international transfer fees and tax implication in Japan, finding a head research assistant who was highly knowledgeable in reading cursive Japanese (kuzushiji), recruiting research assistants on Twitter, automating some of the data entry, establishing rules for entering the remaining data, and finally publishing the results using FAIR principles 2) A human-in-the-loop machine learning project that uses neural networks to detect and classify some one-hundred thousand stamps in the archive. This involved collaboration with computer scientists at the Pittsburgh Supercomputing Center and has already resulted in the publication of one paper. The computer scientists wrote the code, but they relied on content experts for training the model by manually entering some of the data. 3) Crowdsourcing the transcription of the entire text of the Mitsui Mi'ike Mine archive using the Minna de honkoku crowdsourcing website, which would have opened the tantalizing possibility of making the entire thirty-thousand pages of documents machine readable. This third possibility never took place because the Mitsui Archive was concerned about the liabilities of putting the documents online. Each of these three approaches requires different amounts of labor, types of labor, funding schemes, and amounts of time. And each yields different outcomes. Their comparison shows how issues of access, location, technical expertise, and the politics of memory can interface with various text mining technologies, research assistance, funding schemes, and publishing venues.

Panel Transdisc_Digi_05
Digital humanities individual papers II
Session 1 Saturday 19 August, 2023, 9:00-10:30

A A A A A