When models judge models: automated evaluation and the making of scalable machine intelligence

Accepted Paper

Monika Jankowska (Rice University)

Paper short abstract

The paper examines how AI researchers and developers move away from direct human evaluation of AI systems toward automated, model-driven assessments, through which standards of machine competence are defined and applied, while the role of human judgment in these processes is increasingly obscured.

Paper long abstract

Efforts to make artificial intelligence (AI) systems behave in ways people find competent and acceptable are often analyzed in terms of adequate governance practices or the human labor involved. Less attention has been paid to the automation of the evaluative processes through which these efforts are realized. As expectations that AI systems perform demanding tasks at scale intensify, AI researchers and developers seek ways to evaluate performance across complex and heterogeneous tasks. Such evaluation has traditionally relied on human data and judgment, which have been foundational to the development of AI. Yet, within scaling-oriented logics, such reliance increasingly comes to be treated as a bottleneck. To bypass this constraint, evaluation and adjustment have been shifting toward approaches such as reinforcement learning from AI feedback (RLAIF) and “LLM-as-a-judge,” which enable AI systems to refine their performance, master increasingly complex tasks, and align with human needs through internalized, machine-led feedback loops. Drawing on ethnographic research conducted at a university-based computer science laboratory in Beijing and an analysis of relevant technical literature and expert discourses, this paper examines how these mechanisms produce a form of machine intelligence oriented toward continuous improvement and broad deployment and how this orientation redefines what counts as competent reasoning within AI development. At the same time, it examines how these mechanisms shift judgment from an external human activity into a hidden layer of the system itself, where standards of competence are increasingly automated and treated as intrinsic system properties, and the role of human judgment is further obscured.

Panel P044
The Transhuman condition? Rethinking intelligence, sentience, and personhood in the age of AI
Session 2 Thursday 23 July, 2026, 11:30-13:15

A A A A A