Log in to star items.
- Convenors:
-
Charlotte Högberg
(Lund University)
Stefano Canali (Politecnico di Milano)
Francis Lee (Södertörn University)
Send message to Convenors
- Format:
- Combined Format Open Panel
Short Abstract
Synthetic data are touted as fixes to technical, ethical, and political challenges. But what do they mean for the politics of representation in knowledge production? This panel explores the ontological politics of AI and synthetic data, asking how they reshape what—and who—counts as real.
Description
The increased use of artificially generated data and media raises a multitude of concerns, including how the world is represented, and how generated data is enacted as equivalent through chains of “circulating reference” (cf. Latour, 1999). Synthetic data is increasingly used to represent real-world referents and is presented as a solution to key problems of data-driven technological development and knowledge production (Lee et al. 2025). For example, generative models are used to mimic health data that would otherwise be difficult to share or to simulate traffic to train self-driving cars (Jacobsen, 2023). As our futures become increasingly intertwined with synthetic data, this panel aims to investigate what forms of knowledge, futures, and worlds that they enact.
We aim to create a meeting space for posing questions about the increasing use of synthetic data: What are the epistemic consequences of using synthetic data? How does the relation between “reality,” “data,” and “representation” change? How is synthetic data used to address “missingness” and gaps of representativity? What are the impacts on transparency and understanding of representations, as synthetic datasets travel and are reused across projects and worlds? How can we evaluate the accuracy of synthetic datasets?
By posing these questions, this panel will expand emerging discussions in STS about the ontological politics of artificial intelligence and synthetic data (Jacobsen, 2023, Johnson & Hajisharif 2025, Lee et al. 2025). One of the goals of the panel is to connect ongoing discussions about data (Gitelman, 2013; Leonelli, 2019), algorithms (Seaver, 2017), and artificial intelligence (Suchman, 2023) in order to inquire into how synthetic data is emerging as a new set of practices and politics in knowledge production. We welcome empirical, theoretical, and artistic contributions that engage with synthetic data in knowledge making.
Accepted contributions
Session 1Short abstract
In this talk I discuss synthetic data as an emerging scientific application of generative artificial, frame it as the extension of surrogative forms of reasoning and discuss their possible radicalisation into a new phenomenon we need to address: artificialisation.
Long abstract
In the sciences, synthetic data are increasingly presented as a way to fill in epistemological and as well as ethical gaps. For instance, in the health context increasing contributions present synthetic data as fixes to critical issue of health datasets and science such as bias, lack of representativity, high costs. There are various epistemological implications of synthetic data, but in this talk I will focus on a more general framing: I argue that we should frame these and other applications of generative artificial intelligence and machine learning as an extension of surrogative forms of reasoning in science.
Surrogative reasoning has significant grounding in recent scientific practice. For instance, the scientific study of health has a history of using technologies and tools that mediate between scientific interests and world phenomena and work as surrogative systems that are more directly accessible or easier to manipulate experimentally than the target systems under investigation. Synthetic data give us a clear opportunity to study new tools for surrogate reasoning, but they also push towards a new radicalisation that I call 'artificialisation', an extension of surrogative reasoning where generative artificial intelligence and machine learning are used to study phenomena with indirect tools and technologies. Artificialisation raises issues we need to tackle – to understand if artificialisation presents a problematic decoupling of data generation from empirical observation and whether it produces recursive scientific forms of practices, with the risk of generating 'artificious' science.
Short abstract
This academic paper applies the data journeys framework to synthetic data, tracing how data are constructed, translated and stabilised in practice. Drawing on qualitative work with two industry partners, we examine what synthetic data are and what their reuse demands of social research.
Long abstract
Synthetic data is increasingly positioned as a solution to longstanding problems of access, missingness and representativeness in social research. Yet as social researchers are drawn into contexts where synthetically generated data are available for reuse, critical questions arise: what are these data and how did they come to be? The sociodigital practices through which synthetic data are produced, the decisions embedded in their generation, and how they circulate data across contexts remain largely opaque at the point of reuse. This opacity is not incidental but constitutive. Synthetic data do not record social worlds, they construct representations shaped by practices, infrastructures and assumptions that conventional methodological frameworks are poorly equipped to trace.
This academic paper proposes that the data journeys framework (Bates et al., 2016) offers conceptual resources for addressing this gap. Developed to trace the relational, contingent, practice-laden processes through which data are made and transformed, data journeys directs analytic attention towards the moments of construction, translation and stabilisation that synthetic data production involves. We report on ongoing qualitative work with two industry organisations engaged in synthetic data production, using these collaborations to develop and stress-test the framework across different generative contexts. We reflect on what data journeys can reveal, where it requires adaptation, and what forms of methodological collaboration are necessary. We argue that STS-informed methods for tracing synthetic data provenance are not optional refinements but preconditions for epistemologically robust social research in a moment of rapid expansion in generative AI.
Short abstract
Synthetic data is used for training and benchmarking AI models. However, research on LLM biases and temporal representation suggests that systematic distortions can affect generated data. We integrate insights from previous research to advance systematic evaluation practices for synthetic datasets.
Long abstract
AI training pipelines often face shortages of high-quality, diverse, and ethically shareable data. To address this limitation, synthetic data is increasingly used for training and benchmarking, to mitigate data scarcity and avoid privacy concerns. Alongside large language models (LLMs), smaller models are also used to generate and augment textual, numeric, and mixed datasets.
At the same time, a growing body of research shows that LLMs exhibit systematic biases regardless of model size. Some biases resemble human cognitive or social patterns, while others—such as centrality bias—appear to be artifacts of model training. When LLM-generated synthetic data is reused in training and evaluation pipelines, these distortions risk propagating through the entire data cycle of generation, training, and benchmarking.
Post-training methods such as instruction tuning with human feedback can improve a model’s ability to produce structured outputs, including tabular data. However, these techniques may also intensify certain preference biases.
Temporal biases are particularly important because real-world datasets are inherently temporal and contain complex relationships between events and time. LLMs often struggle to capture these relationships reliably, instead reproducing learned associations that may reflect spurious correlations.
In our recent work, we used synthetic data generation to examine the interaction between inherent LLM biases and representations of historical reality in the extremely low-resourced domain of historical medical periodicals. Building on this case study, we propose integrating research on synthetic data evaluation with studies of LLM biases and temporal distortions to support more robust evaluation practices.
Short abstract
Synthetic data is increasingly developed and used in medical research, but risks (re)producing narrow, sanitised worlds. This talk examines the knowledge politics shaping synthetic data in medical research, through the concept of the ontological filter which draws upon Barad's agential realism.
Long abstract
Synthetic data is framed as a key approach for improving representation of marginalized groups in medical datasets, improving access to data while preserving privacy. Despite these aims, synthetic data models risk reproducing narrow, sanitised versions of the world which prioritise the logics underpinning AI whilst being easy to scale up (Jacobsen 2024; Pasquinelli 2023). Responding to these concerns, this talk considers the ethico-onto-epistemology of synthetic data in medical research, bringing together findings from conceptual analysis and semi-structured interviews.
In this analysis, we apply the conceptual lens of Karen Barad’s agential realism - where material-discursive entanglements draw boundaries between what matters and doesn’t, shaping specific ‘worlds’ (Barad 2007). Real-world data has direct indexical grounding, meaning it points back to something that happened in the world as a representation of that event, which necessarily involves choices about what counts as ‘residue’, e.g. noise, edge cases, contradictions, and correlations which are judged as irrelevant. In contrast, synthetic datasets make data features legible to AI models, which go on to employ them in another instantiation of the data gaze (Beer 2018). The shaping of these datasets means there is minimal ‘residue’ outside the representational frame of the latent space of the model, as the data has already been passed through what we term an ‘ontological filter’, where noise becomes intentional rather than a problem to be fixed. In this panel, we interrogate the knowledge politics of how these are constructed and the implications of this for representation of marginalized groups.
Short abstract
Synthetic heart data are claimed to boost the performance of predictive models. This presentation discusses the epistemic implications of how synthetic data is experimented with and used for heart failure prediction and what it conveys about synthetic representations of the normal and pathological.
Long abstract
From a single heartbeat, computational models are used to extrapolate consecutive beats for ECG recordings. This is one example of how data about the heart is generated by computation. To improve and personalize the prediction of risk and prognosis of cardiovascular diseases is also a common motivation for using synthetic electronic health records, seen as increasing variability and enlarging training datasets to develop better AI models. In this presentation, I discuss the epistemic and representational implications of how synthetic data is experimented with and used to boost the performance of heart failure prediction.
Studies suggest that AI models trained wholly or partially on synthetic ECG data can outperform models trained exclusively on real world recordings, by enhancing generalizability, reducing overfitting, and compensating for class imbalance. A persistent challenge is evaluating the accuracy and clinical validity of the generated heartbeats. This raises the question of what is seen as good representations of human hearts, their functions, variations and pathologies?
Drawing on an ongoing ethnographic study, theories about the creation of normalcy and pathology (Canguilhem, 1989) and the politics of prediction (Amoore, 2013), I highlight practices of how the heart is represented by synthetic data in medical sciences and AI. By this, I make visible the epistemic considerations and implications of synthetic data hearts, heartbeats and patient data, for heart failure prediction and discuss what it conveys about how synthetic data are enacted as representations of the normal, pathological and risky heart.
Short abstract
This paper proposes reading synthesis and generativity as instances of a politics of plausibility centered on excess, deception, experimentation and affect. As a governing rationality, plausibility operates a reduction of the possible while performatively shifting the very category of the plausible.
Long abstract
This presentation centers the logics of (digital) synthesis across domains to interrogate the politics of generativity shaping contemporary artificial intelligence (AI). It traces a genealogy of digital synthesis, situating synthetic data and their generative excess (Amoore et al. 2024; Jacobsen 2023) within a longer history of biological emergence, evolution, heredity, modeling, and synthetic biology. Building on this lineage, I propose that synthesis, as a governing rationality, instantiates a novel form of future-oriented governance—one that no longer merely anticipates and manages risks, whether probable or possible (Amoore 2013), but instead seeks to intervene on reality and shift normativities through the sustained (re-)introduction and (re-)formulation of the plausible. Drawing on examples spanning synthetic datasets, AI-generated genomes, AI-assisted drug "discovery" and species "de-extinction", this presentation explores the onto-epistemic politics of the plausible: plausible objects and beings populate landscapes of the real, recursively shaping the very category of plausibility. I offer a notion of politics of plausibility to probe new ways to address epistemic and political questions tied to digital (and biological) synthesis by foregrounding and characterizing aspects of the plausible, such as excess, potential deception, experimentation and affective engagement. Examining synthesis, generativity and contemporary forms of AI through the lens of plausibility surfaces a novel locus of power: namely, the reduction of the possible to the plausible, and the continuous reshaping of plausibility itself through the experimental synthesis of countless plausible ontologies.
Short abstract
This paper examines AI-generated bodies through documentation theory, arguing that synthetic data practices produce “documentary bodies.” It explores how generative systems reshape the politics of representation and knowledge about bodies.
Long abstract
Contemporary digital infrastructures increasingly translate human bodies into data, profiles, and computational representations circulating across platforms and datasets. While these processes are often approached as technical operations, they can also be understood as documentary practices through which bodies are inscribed, stabilised, and made available for knowledge production.
Drawing on document theory, this paper argues that bodies function as documents within socio-technical systems. Through processes of inscription, classification, and circulation, embodied persons are transformed into documentary objects that can be stored, mobilised, and interpreted across institutional and technical contexts.
Approaching data infrastructures as sites where representations are actively produced foregrounds the practical work through which certain traces of embodiment come to stand in for persons and populations. Developments in generative AI and synthetic data intensify these documentary dynamics. Generative models increasingly produce artificial likenesses, simulated biometric patterns, and synthetic datasets intended to represent human populations. Rather than reproducing existing individuals, these systems assemble documentary bodies: constructed representations standing in for embodied persons within data infrastructures.
Framing synthetic data practices as documentary processes foregrounds the politics of representation embedded in AI systems. Which traces of embodiment become legitimate data? How are bodies abstracted, recomposed, and circulated as authoritative representations? And how do synthetic bodies participate in the production of knowledge about populations and persons? By situating AI-generated representations within longer histories of documentation, inscription, and classification, this paper contributes to STS debates on the ontological politics of synthetic data and the changing conditions under which bodies become knowable through computational systems.
Short abstract
If all data are local achievements,how should we understand locality when data are generated artificially?This paper revisits locality through the lens of synthetic data, examining how representation, reference, and reality are (re-) configured under generative/ probabilistic /modelled conditions
Long abstract
Synthetic data are frequently framed as privacy-preserving alternatives to empirical data, positioned as solutions to technical, ethical, and political constraints (Nicolenko, 2021; Steinhoff, 2022). By contrast, STS scholarship has long emphasized data as practical achievements—ex. sublata (Latour, 1999). However, synthetic data seem to confound much analysis, even in STS, reintroducing dichotomous analyses about the boundaries between the “real” and the “synthetic”. Approaching synthetic data as achievement, begs us to think about the situated and local character of synthetic data (Loukissas, 2019) redirecting our analytic attention from data as a given but SynData as achievements in particular practices, settings, and infrastructures. Thus, when “ground truth” is constituted through iterative stabilization internal to generative modelling processes, resemblance becomes probabilistic and architecturally pre-structured, where is locality enacted and how can we look at the data setting instead of the data set? Engaging in recent work on fabrication, simulation, and synthetic data infrastructures (e.g. Suchman, 2023; van Voorst & Ahlin, 2022; Seta et al., 2024;), we argue that synthetic data do not negate locality but displace the sites at which it becomes analytically visible. Although synthetic datasets are framed as context-free, their capacity to function as evidence rests on modelling assumptions, validation practices, and institutional thresholds. Attending to these conditions foregrounds the sociomaterial practices through which probabilistic outputs become credible, comparable, and actionable. In doing so, the paper examines how representation, reference, and reality are stabilized under generative and synthetic conditions and considers how synthetic data shape ongoing politics of locality in knowledge production.
Short abstract
Synthetic data is increasingly used in AFP development as a response to privacy concerns. However, its use appears to obfuscate rather than solve privacy issues related to AFP developement, which only reinforces racialized and gendered systems systems of control/domination enabled by AFP.
Long abstract
Synthetic data is increasingly used in AFP development as a response to privacy concerns. In fact, AFP is largely based on machine learning techniques that require, among other things, very large datasets to train their algorithmic components. However, assembling face training data can be taxing – as a result of data protection regulating face image use as well as the cost of collecting and annotating the required training data. The latter requires the most vulnerable people are responsible for annotating photographic images of people, often extracted from the internet or video surveillance recordings. As a result, synthetic data has increasingly been seen as an efficient option to generate face training data cheeply and with greater care to privacy.
This article is based on a qualitative study based on 36 semi-structured interviews conducted with workers operating on freelance platforms from 11 different countries of the majority world and 11 interviews with AFP researchers based mostly in the global north. This paper makes sense of the production and use of synthetic data as a response to the legal and economic costs of AFP training data production. It offers an empirical analysis of AFP supply chains and argues that synthetic data use only makes the latter more opaque, as it itself requires original training data to be generated. The use of synthetic face data thus appears to obfuscate rather than solve privacy issues related to AFP developement, which only reinforces racialized and gendered systems systems of control/domination enabled by AFP.
Short abstract
Drawing on ethnographic research in data-centric biomedicine, this paper examines how synthetic data are used in model validation. It argues that rather than representing reality, synthetic datasets function as epistemic devices that reorganize regimes of visibility and validation in AI modelling
Long abstract
This paper examines how synthetic data are used in the situated practices of model construction and validation in contemporary data-centric biomedicine. Drawing on ethnographic fieldwork and interviews with bioinformaticians, computational biologists, and computer scientists, we explore how synthetic datasets are deployed to test whether methods can recover known structures embedded in the data. In these settings, synthetic data function as epistemic devices through which the behaviour of models is rendered visible and assessable. We confront this empirical material with an analysis of the technical literature on explainable and trustworthy AI. These approaches focus on making inferential pathways visible in order to render model outputs interpretable and intelligible. We argue that this shift reflects a broader epistemological transformation involving regimes of visibility, epistemic virtues, and socially organised practices of seeing and representing. Our claim is that synthetic data do not introduce a radical break in scientific representation. Rather than their “synthetic” nature, what deserves scrutiny are the changing conditions under which validated knowledge is warranted. While the situated practices we studied were oriented toward aligning modelling workflows with particular regimes of visibility that define what counts as valid knowledge, the adoption of deep-learning AI systems (using both synthetic and real datasets) and the rise of explainability techniques shift the domain of the visible from data-model structures to the reliability and interpretability of model outputs. More broadly, the paper shows how AI and synthetic data reorganize regimes of validation by shifting attention from data-model relations to model output credibility.
Short abstract
Synthetic data is widely seen as threatening qualitative research. Drawing on anthropology's history of composites, pseudonyms, and constructed temporalities, this paper reframes LLM-generated data as continuing the fabrication practices through which qualitative knowledge was always produced.
Long abstract
Debates about synthetic data typically center on quantitative concerns: biased training sets, WEIRD behavioral models, the impossibility of demographic diversity. This paper shifts the terrain to qualitative knowledge production, where synthetic data poses a more fundamental epistemological challenge but where, paradoxically, synthetic fabrication has always been a constitutive method.
Drawing on anthropology's history of methodological fabrication, I demonstrate that qualitative research has long operated through structurally synthetic practices: pseudonymization that transforms observation into literary construction, composite characters, reconstructed dialogue, compressed temporalities, and the fieldnote itself as the first site where lived complexity becomes selective inscription. From Kroeber's "generalized dummies" of the 1920s through the Writing Culture debates to contemporary speculative ethnographies, the discipline has continuously trafficked in synthetic data while maintaining the fiction that its authority rests on unmediated empirical encounter.
This genealogy reframes LLM-generated qualitative data. Rather than asking whether synthetic interlocutors can substitute for "real" human responses, a framing that assumes we know what "real" means, I argue they should be evaluated by what thinking they enable. Following Trouillot's distinction between productive fiction and deceptive fake, the relevant criterion is transparency about constructedness, not fidelity to an empirical referent. Synthetic encounters function as methodological mirrors, externalizing interpretive labor qualitative researchers have always performed internally.
This argument carries broader implications for the ontological politics of synthetic data: the anxiety it provokes reveals less about machines than about disciplines' unexamined boundary work: the selective policing of which synthetic products count as empirical knowledge.
Short abstract
What kind of city gets generated when a digital twin increasingly runs on synthetic data rather than observed reality? This paper argues that the turn to synthetic data for modelling urban life marks a broader epistemic shift in how populations and spaces are defined, measured, and governed.
Long abstract
Urban digital twins are usually presented as data-driven mirrors of the city, assembled from sensors, administrative records, and predictive models. At the same time, many twins now operate, at least in part, on synthetic data: generated populations, mobility traces, and behavioural patterns used to fill gaps, satisfy privacy constraints, train models, or extend forecasting where empirical data are thin. This paper asks what happens when the twin is no longer copying a city but generating one, and what kind of city gets generated in the process.
It develops the concept of synthetic twinning to name the coupling of generative data practices with the infrastructural work of building a twin. Through this coupling, synthetic data render urban territory calculable by translating heterogeneous neighbourhoods, infrastructures, and movements into comparable variables and model-ready surfaces. Drawing on STS approaches to modelling and infrastructural politics, the paper examines how generative models define what counts as normal, plausible, or risky. They impose distributions, smooth extremes, and introduce artefacts of their own, including reduced variability and the disappearance of outlier places. These moves travel into municipal decision-making, reshaping what counts as evidence and where responsibility lands.
The argument draws on preliminary empirical work on the Bologna municipal digital twin, including interviews and project documentation. The case examines how synthetic data are mobilised where datasets are missing across space or time, and how decisions about realism and bias quietly shape what the twin can predict, and what those predictions come to justify in urban governance.
Short abstract
Generative AI models fine-tuned on Swedish Romantic fiction, the William Blake Archive, and Swedish medical periodicals generate “synthetic pasts.” We propose synthetic hermeneutics to examine how such machine co-produced histories reshape cultural patterns and historicity.
Long abstract
This presentation examines how generative AI models trained on historical corpora produce synthetic pasts that both reproduce and reconfigure cultural patterns.
Empirically, we work with multimodal datasets spanning Swedish Romantic fiction (Claes Livijn), the William Blake Archive, and postwar Swedish medical periodicals (SweMPer). These datasets are used to fine-tune state-of-the-art language and image diffusion models, generating non-existent historical texts and images such as completed nineteenth-century novels, Blakean poems and plates, synthetic medical case reports, and era-specific medical advertisements.
Treating these outputs as a form of synthetic data, we analyze them through posthumanist media theory, media archaeology, and Hayden White’s notion of the practical past. This allows us to explore how AI-generated materials function as epistemic probes into latent temporalities, genre conventions, and biases sedimented in large-scale cultural datasets. We argue that, in this context, synthetic data do not merely “fix” problems of access or bias. Rather, they enact the very question of a politics of representation—a question that cannot be adequately addressed without first understanding how technical systems co-produce historical knowledge.
By foregrounding the ways in which these systems participate in constructing historical knowledge, we raise new questions about authorship, authenticity, and the status of machine-generated historiography in both scholarly and artistic practice. We propose the notion of synthetic hermeneutics to describe how synthetic pasts operate as epistemological tools, using machine-generated history to probe, contest, and reconfigure what is permitted to count as historical knowledge.
Short abstract
This presentation explores the epistemic consequences of synthetic data in science by drawing on ethnographic fieldwork at NASA. It demonstrates how the epistemic stakes of synthetic data in one domain can differ from what is at stake in other societal contexts.
Long abstract
In the realm of programming, synthetic data often figures as a way to resolve too scarce amounts of data to train AI. However, the making, use and evaluation of synthetic data in science is more than a new technique to generate data – it is a set of practices that brings changes to the epistemic cultures in science (Knorr Cetina, 1999).
This presentation explores the epistemic consequences of synthetic data in science by drawing on ethnographic fieldwork at NASA. It contributes with the concept of epistemic responsibility to theorize the nexus of negotiations between epistemic cultures and responsibility in scientific knowledge production. Moreover, it expands the concept of truth-spots (Gieryn, 2006; 2018) to the digital realm, opening up questions about the epistemic value ascribed to datasets.
By drawing on a case of science at NASA, this presentation demonstrates how the epistemic stakes of synthetic data in one domain can differ from what is at stake in other societal contexts.
Short abstract
Synthetic medical data is touted as privacy‑preserving but is not value‑neutral. Prioritizing privacy over transparency and scalability over fidelity erodes explainability, trust, and patient‑centered care, producing “zombie data,” more reification, and fungibility that reshape the medical realm.
Long abstract
Synthetic medical data (SMD) has been introduced to address data-acquisition challenges. It has been described as a “magical” solution that enables model training without compromising patients’ privacy and is therefore considered ethically desirable (Bellovin et al. 2019; Savage, 2023). However, as we will show in this article, synthetic data is not value-neutral, and it poses both epistemic and ethical trade-offs that cumulatively affect the medical realm. We focus on three primary methodologies: two key deep learning approaches, namely generative models without privacy guarantees and differential privacy (DP) models. In addition, we will use vine copula models as our primary example to demonstrate the statistical approach. We show that prioritizing privacy over transparency undermines explainability, which in turn hinders trust and fairness, pushing these values down the ladder. We also show that prioritizing scalability over fidelity moves medicine away from personalized, participatory, and patient-centered approaches and toward a naturalistic, mechanistic understanding. The collapse of context and experiential knowledge into “zombie data” yields a narrow conception of utility in medicine that ignores the patient as a person. The absence of a direct real-world reference (combined with the use of this data to train algorithms) is where SMD makes a step forward in the reification process relative to “regular” medical data. The data collected to represent reality may gradually subsume the very thing it is intended to represent. As such, SMD also introduces a new degree of reification in medicine, promoting fungibility, which offers clear advantages for data-intensive enterprises, over patients and minorities.