Short abstract:
This paper examines research on language and culture in computational linguistics in order to understand and theorize the field’s critique of itself. It further aims to characterize the language ideological assumptions that motivate the construction and application of the tests.
Long abstract:
How do the computational linguists and computer scientists who develop LLMs understand language and culture? In this paper, we examine research on language and culture in the field of LLMs to understand how the field critiques itself.
We postulate that the first round of critique, aimed at supervised machine learning classifiers, was the discovery of “bias” and the response to this discovery was “balancing the training sets” (Garrido-Muñoz et al. 2021, Shah et al. 2020). In the current era of machine learning, the critique is aimed at unsupervised models reinforced with human feedback, and presenting emergent qualities – meaning while the mode of interaction is predicated, the range of outputs is not. Previous NLP benchmarks that measure how closely a model is able to imitate natural language and use formal attributes such as GLUE (which measure a model’s ability to answer questions, detect sentiment, infer, and perform other generalizable tasks) and MAUVE (which measures how close machine-generated text is to human language), are no longer sufficient. Broader access to LLMs, has made testing on human benchmarks such as SAT, the Bar, and cultural alignment tests such as Hofstede Culture Survey (Yong Cao et al. 2023) popular to probe the quality of models (both for marketing purposes and in research papers).
This paper examines standardized testing as a benchmark for machines in order to (1) explore the underlying language ideological assumptions of LLM developers which (2) inform how they understand and critique the production of meaning in synthetic text.