Log in to star items.
- Convenors:
-
Niccolò Tempini
(University of Exeter)
Laura Savolainen
Florian Jaton (EPFL)
Benedetta Catanzariti (University of Edinburgh)
James WE Lowe (University of Exeter)
Send message to Convenors
- Discussants:
-
Minna Ruckenstein
(University of Helsinki)
Susanne Bauer (University of Oslo)
Geoffrey Bowker (University of California, Irvine)
- Format:
- Combined Format Open Panel
Short Abstract
The study of ground truth construction practices reveals the contingent, negotiated processes underlying AI's epistemological claims. This track pursues theory-building to understand how provisional data functions as foundational 'truth' and what other approaches might yield more resilient futures.
Description
Ground truth construction (the production of datasets for training and evaluating machine learning systems) constitutes a critical site for examining the epistemological foundations underpinning contemporary AI development. While ground truths ostensibly provide objective benchmarks for model validation, empirical investigations reveal negotiated, contingent, and context-dependent processes that challenge straightforward assumptions about data, evidence, and measurement in automated systems.
This track interrogates ground truth construction as an analytical aperture into broader questions concerning knowledge production and AI development practices. An STS lens can empirically examine: how radical uncertainty is stabilised in order to enable the production of epistemic claims; what practices and socio-technical arrangements enable provisional data to function as foundational truth; how commercial imperatives, organizational logics, and platform architectures constrain what constitutes valid or sufficient evidence; and what alternative methodological frameworks might better accommodate the inherently speculative character of AI knowledge systems.
Ground-truthing practices exemplify tensions central to the conference theme of more-than-now and resilient futures. Current AI development practices privilege pragmatic expediency over robust epistemological foundations, and what works today rarely resiliently carries out to the changed contexts of tomorrow. AI systems remain notoriously brittle precisely because their foundation in the reliance on a ground truth embodies compromises between contingency and robustness. Examining these practices through an STS lens enables critical inquiry into how things could be otherwise: what are the alternative futures that are foreclosed?
We particularly welcome contributions pursuing theory-building and methodological innovation rather than purely diagnostic critique. Papers might develop novel conceptual vocabularies for understanding epistemic practices under uncertainty; propose alternative validation frameworks oriented toward resilience; offer comparative analyses across domains; or trace genealogies illuminating how current practices achieved normalization.
The organizing team brings diverse empirical grounding across multiple AI research and development domains. The combined format integrates traditional paper presentations with a commentator roundtable session facilitating collective theorization of speculative infrastructures constituting AI knowledge production.
Accepted contributions
Session 1Short abstract
This presentation proposes a processual theoretical framework to trace how conceptual assumptions are embedded in machine learning algorithms via ground-truth datasets, adapting Strauss’s “arc of work”. It draws on two empirical case studies, in the fields of justice and environmental health.
Long abstract
Machine learning algorithms are often portrayed as large-scale statistical tools that marginalize theory and conceptualization. Yet research on ground-truthing and data annotation shows that conceptual work remains central to model training and crucially shapes algorithmic outcomes. This work, however, largely remains invisible, as it is both displaced upstream into data preparation processes and fragmented across multiple actors involved in algorithmic production, including domain experts, engineers, data scientists, and annotators.
In this context, how can the conceptual, political, and social assumptions shaping algorithms be traced as they are incorporated through ground-truth datasets? This presentation proposes a methodological and conceptual framework for studying the progressive constitution of ground-truth datasets as a situated, processual activity. It adapts Anselm Strauss’s notion of “arc of work” to follow the successive stages of dataset construction and annotation, highlighting the articulation work required to align a plural, unstable reality with rigid classification systems. This approach also accounts for the diversity of professional worlds involved, making visible the tensions, turning points, and iterative adjustments through which AI systems are gradually configured and reconfigured.
The framework is grounded in two empirical fields: an ethnography (2020–2024) at two sites of AI production within the French justice system (Supreme Court and Ministry of Justice), and interviews and document analysis (2025) dedicated to the development of an AI tool in the field of environmental health within the French Ministry for Ecological Transition. In both cases, the study follows the full production chain, from category definition to annotation practices.
Short abstract
We examine counterfactual medical image generation in health Causal AI. Generating hypothetical data shifts real-world observations towards simulated ideal conditions. We highlight ‘pragmatic expediency’ and the role of synthetic data aiming to overcome the imperfect state of real-world data.
Long abstract
Robustness of AI models is often attributed to the representation quality of training data – a questionable assumption (e.g. Jaton 2025). Growing interest in the use of synthetic data has built on the acknowledgement that existing data are imperfect, making the former desirable for improving sample quality (Jordon et al. 2022; Offenhuber 2025). We examine counterfactual image generation (e.g. Roschewitz et al. 2025) in health Causal AI.
To capture real-world diversity, training data should be assembled by sourcing datasets across different countries, health systems and populations. In practice, this means accessing an uneven and relatively contingent collection of datasets. Synthetic images can then be useful to address gaps in data availability, coverage and quality.
Clinically generated data can be ‘cleaned’ from noise through the generation of ‘what-if’ images, for instance, through simulating the way scans would look in different lighting conditions. Similarly, existing medical scans can be used to generate a scan for a hypothetical patient who researchers would like to be, for instance, a different age. Manipulating existing data to generate hypothetical clinical observations could shift real-world observations toward a distribution generated in ideal conditions, which can serve as ground truth data for machine learning.
Synthetic data can serve as ground truth thanks to pragmatic and empirical strategies. Here the picture is more complicated than a naïve account of ground truth construction would ‘paint’ it. We highlight the importance of ‘pragmatic expediency’ and outline a case for the role of synthetic data in this more complex picture.
Short abstract
This paper proposes a theory of ground truth as fiction, building on Sylvia Wynter's analysis of how Western modernity mistakes its particular map for the territory itself.
Long abstract
This paper proposes a theory of ground truth as fiction. We argue that machine learning's "ground truth" datasets continue this operation: naturalizing situated, partial vision as universal reality—overrepresenting specific ways of seeing as if they were the only way to see.
In the 1750s, German Art Historian and archeologist Winckelmann established pure white marble as Greek sculptural ideal. Modern analysis proves Greek statues were painted, yet this fiction persists. Winckelmann's operation, mistaking his map (weathered marble) for the territory (Greek aesthetics), established a pattern that is replicated until today in visual analysis, even in machine vision.
Through art historical genealogy spanning neoclassical aesthetics to contemporary machine learning, we trace how we repeatedly mistake our maps for the territory. Following Gil-Fournier and Parikka's argument that ground truth has shifted from physical ground to the "ground of the image," we examine how each visual regime produces this cartographic fiction. Photography positioned chemical trace as objective reality while encoding racialized standards. Aerial reconnaissance literalized "ground truth" through imperial surveillance. Machine learning datasets like ImageNet naturalize Western visual taxonomies as universal training foundations. We are still mistaking maps for the territory.
Art historical methods reveal ground truth's constitutive operations: singular claims erasing plural ways of seeing, partial apparatus positioned as neutral observer, power relations encoded as objective categories. When painted statues violate expectations, when facial recognition fails, glitches expose the gap between map and territory, revealing we have been using maps all along. What pluriversal alternatives does ground truth's singular overrepresentation continue to foreclose?
Short abstract
Through ethnography and analysis of variational autoencoders, this paper explores how epistemic claims are produced in unsupervised learning systems that do not require ground truth. It shows how uncertainty and validation are enacted across algorithmic pipelines, rather than eliminated.
Long abstract
“Ground truth is a thing of the past”: a phrase I frequently encountered in a Japanese robotics laboratory. This statement justified a move away from supervised learning, towards approaches that do not rely on labelled data. According to my interlocutors, abandoning bias-prone processes of human annotation allows systems to become generative beyond, and more-than, human benchmarks of objectivity and validation.
This paper starts from these intuitions to ask how epistemic claims are stabilised, when ground truth is no longer explicitly constructed, and whether unsupervised learning truly escapes the epistemic contingencies of ‘ground truth-ing’, or instead redistributes them into less visible and harder-to-interrogate mathematical and computational forms. It examines how uncertainty management and truth production are displaced across technical pipelines, rather than eliminated through algorithms framed as “learning by themselves”.
My argument develops along two lines. First, through an analysis of the pipeline of an unsupervised architecture (the variational autoencoder), I show how particular modes of organising truth and knowledge become embedded in mathematical logics that reconfigure the relation between generativity and robustness. I argue that unsupervised learning does not eliminate contingent processes of truth-making; rather, it redistributes epistemic labour into formal procedures often opaque to social-scientific critique.
Second, the paper proposes a methodological reorientation. By tracing how epistemic claims are enacted through mathematical-statistical structures themselves, it demonstrates the value of engaging with how algorithms function, not merely what they do, so as to interrogate modes of epistemic classification and constraint that may seem alien to us, until they are not.
Short abstract
Far from representational stand-ins, ground truths are functional, pragmatic artifacts of model machinery. Examining neural mass modelling for neonatal developmental risk management, we show ground truthing as multi-layered, relational, and shaped by structural contingencies in data and model alike.
Long abstract
In this paper, we discuss practices of ground truth construction in healthcare-applied mathematics research, and specifically, their role in the quantification of uncertainty. Our argument builds on the Digital Twins for Modelling Neurodevelopment project. Mathematicians and clinical pediatricians are working to develop a neural mass model to identify the newborns who are at high risk of developmental delay, so that they can be routed through appropriate care pathways. A working model would interpret key signatures in the EEG of a sleeping newborn to this effect.
The central challenge emerges from the model's non-linear structure: multiple parameter sets can produce identical model outputs, making the interpretation of clinically-observed EEG inherently uncertain. Optimization algorithms generate partial, algorithm-dependent parameter distributions. These are validated through a second sampling method (Latin Hypercube Sampling), which constructs an ulterior ground truth, about abstract parameter space, with which to evaluate the former. This cannot be exactly accomplished, so validation relies on expert judgment visually inspecting complex datasets with a probabilistic ground truth.
This paper shows how in non-linear mathematics modelling, ground truth practices are multi-layered. A focus on the meta-level validation structure of LHS shows why comprehensive, perfect data cannot yield perfect models. Ground truthing moves away from exact pattern recognition toward flexible, robust tuning that acknowledges the limitations of initial models. Ground truths are pragmatic, functional, yet abstract and probabilistic, artifacts. Far from being representational stand-ins for reality, they are relational, operationalized constructs embedded within specific clinical contexts. This understanding reshapes how we conceptualize data and modelling.
Short abstract
Through interviews with expert data labelers and a socio-technical analysis of RLHF, this paper shows how expertise stabilizes uncertainty and legitimizes ground truths in non-convergent domains like the humanities. It also reveals the opacity of labor platforms when assigning expert credentials.
Long abstract
This paper examines the recent practice of hiring “experts“ – defined by model developers and labor platforms as Master’s and PhD degree holders in relevant domains – to enhance model performance in the Reinforcement Learning with Human Feedback(RLHF) phase of LLM development. It combines a socio-technical analysis of the RLHF process with in-depth interviews with “expert” workers across different fields on platforms like SurgeAI and Outlier to understand how model developers conceive of expertise, and the assumptions underlying the technical infrastructure that attempts to “encode” expertise.
Performance improvements in RLHF rely on human reviewers converging on one solution to a problem. While this works in STEM domains where there is usually one correct answer to a problem, in fields like social sciences and humanities that require debate, expertise is narrowed to mean fact retention over nuanced engagement. This epistemic weakness reaches its limits when “expert” logic is applied to creative fields, reducing creativity to credentials, foreclosing the potential for radical uncertainty.
Despite the aforementioned limitations, the insistence on using “experts” across fields shows that expertise acts as a brittle legitimizing category for ground truths rather than a step in model improvement. The contingent foundations of this ground truth are compounded by the erratic behavior of labor platforms, where the title of “expert” is granted opaquely, and is revoked arbitrarily, further undermining the epistemic foundation of this process. This paper highlights the limits of expertise in RLHF and raises questions about how best to encode knowledge in domains defined by uncertainty.
Short abstract
I use a historical case study to probe how researchers established a ground truth for musical similarity — a contested, subjective concept. Their strategies gave researchers evaluating other cultural domains a proof-of-concept. I argue that history plays a vital role in algorithmic explainability.
Long abstract
Ground truths provide the basis for evaluating computational systems. In the musical case, however, evaluation has proven an abiding challenge. “Subjective evaluations are somewhat unreliable…objective evaluation is also problematic, because of the choice of a ground truth to compare the measure to,” observed Sony Music researchers Jean-Julien Aucoutuier and Francois Pachet in a 2004 survey of musical querying methods. Moreover, they readily admitted that the very concepts they were trying to evaluate — usually musical similarity or genre — were “ill-defined,” not readily measurable concepts, and unconducive to consensus (Aucouturier and Pachet, 2003; Pampalk, 2003). Nevertheless, researchers found ways to claim model validation.
This paper examines several experiments by Aucouturier and Pachet (2000, 2002, 2004) on computing musical similarity for music recommendation. This historical case is instructive: early computational systems were smaller, often supervised, and standards were developing, leading Aucouturier and Pachet to offer explicit discussions about their intuitions. They evaluated their system by comparing “similar” song pairs generated by their signal processing algorithm against those songs’ textual metadata. The problem: most returned pairs reinforced existing cultural knowledge. The optimal results were “interesting”— when the algorithm perceived similarity between unexpected songs. Yet the distinction between an “interesting” result and an incorrect one was never articulated: researchers relied on their own judgement. Ultimately, I show that a historical inquiry into musical evaluation demonstrates how imbricated aesthetic and technical questions became on the early internet, a legacy excavatable today. In doing so, I offer a new methodological approach to algorithmic explainability: history-as-method.
Short abstract
In 2023, the Principal Odor Map promised to allow computers to smell and to simultaneously provide insight into how human olfaction works. I examine the contingencies this promise is built upon, how they limit and silo olfaction, and attribute to what tradition made this narrowing of smell possible.
Long abstract
Olfaction as sensory perception has both eluded digitization and understanding. Google Brain's Principal Odor Map (POM) attempted to solve both problems at once with graph neural networks, mapping thousands of chemical features toward perceptual labels. Their promise is that the POM would be an intuitive understanding of how we perceive molecules as smell and provide new gateways toward smell innovation.
Of course, the POM is built upon a genealogy of olfactory experts, usually perfumers, using a limited and private dataset that follows traditions in the fragrance industry. First, I will analyze the ground truths of the POM team's datasets, then compare how their models benchmarked and measured success against trained panelists. Second, I will assess how these models' ground truths are rooted in limited understandings of olfaction but endeavor to standardize said faulty model. Third, I will trace how the POM follows a olfactory heritage bound in materially rich and exclusionary corporate epistemologies and how that motivates the innovation that POM promises.
The story of machine olfaction portrays how arbitrary, quick-to-access datasets of limited human experience acculturated with semantic clues become machine perception: that which affects human sensory perception at large. The standardization of this limited perception however stands opposed to both senses and a body that we do not fully understand, but will compromise to fit our systems. But smells tend to have alternative readings across experts and laypeople, or domains and experience, and olfaction may provide insight into resilient sensory modes in the collapsing futures of machine perception.
Short abstract
Drawing on Eco's interpretive semiotics, this paper conceptualizes ground truth construction as supervised semiotic closure. Through the notion of Model Reader, overcoding/undercoding, and aberrant decoding, it examines how annotations compress meaning, with epistemological impacts on AI outputs.
Long abstract
Ground truth construction — the production of labelled datasets for training and evaluating ML models — is a functional prerequisite for AI development. It is also an epistemological operation: it transforms open, interpretable cultural artifacts into fixed categorical assignments serving as foundational knowledge claims. Drawing on Umberto Eco's distinction between dictionary and encyclopedic models of meaning, I conceptualize ground truth as a form of supervised semiotic closure: the suspension of interpretive openness to produce verifiable data. Where meaning is understood as an unlimited network of culturally situated associations (encyclopedia), annotation practices impose dictionary logic (bounded, stable definitions) onto inherently polysemous material. In LLMs, the cost of this conversion is made invisible by technocultural apparatuses. This paper argues that Eco’s interpretive semiotics provides a conceptual vocabulary for analyzing this transformation and its epistemic costs. Three concepts prove particularly relevant. (1) The Model Reader; (2) Overcoding/Undercoding ; (3) Aberrant decoding. The paper develops this framework through a case study on annotation practices for art and archaeology photographic records, arguing that the brittleness of contemporary AI systems is a predictable consequence of semiotic closure: artificial systems trained and validated on limited human interpretation and knowledge tend to fail when they encounter contexts where the interpretive chain needs to continue further. By applying Eco’s semiotics on ground truth analysis, this paper contributes a novel vocabulary for understanding epistemic practices under uncertainty in AI research.