Click the star to add/remove an item to/from your individual schedule.
You need to be logged in to avail of this functionality.
Log in
- Convenors:
-
Camille Girard-Chanudet
(EHESS)
Assia Wirth
Send message to Convenors
- Format:
- Combined Format Open Panel
Short Abstract:
This panel focuses on manual data annotation for machine learning purposes. It welcomes empirical studies conducted with annotators, documenting the doubts, inquiries, and choices, that progressively shape the training data sets and, thus, the results produced by AI.
Long Abstract:
The growing implementation of Artificial Intelligence (AI) technologies in “sensitive” fields such as health, justice or surveillance has triggered diverse preoccupations about algorithmic opacity. The efforts to crack open the “black box” of machine learning have mainly focused on coding architectures and practices, on the one hand, and on the constitution of training data sets, on the other hand. Both these components of machine learning dispositives are however held together by an essential link, which has largely remained on the fringes of AI studies: the manual annotation of data by human professionals.
Annotation work consists of the manual and meticulous labeling of documents (pictures, texts…) with a desired outcome that the algorithmic model will then reproduce. It can be undertaken by various categories of actors. While annotation conducted by underpaid and outsourced workers has been well documented, these activities can also be assumed by qualified workers even within prestigious professions. All of these empirical cases raise questions regarding the micro doubts, inquiries, choices, and overall expertise, that progressively shape the training data sets and, thus, the results produced by AI. Data labelling is about putting classification systems into practice, defining categories and their empirical borders, and constructing information infrastructures. Despite these strong political impacts, data annotation remains highly invisible.
This panel welcomes papers addressing the question of annotation from an empirical perspective. The contributions may include:
- Ethnographic studies documenting the practices of annotation work in particular contexts.
- Historical studies situating annotation work for AI in a broader genealogy of classification instruments and inscription practices.
- Organizational studies describing the effects of different institutional settings (in-house, outsourced, subcontracted etc.) and social configurations (gender, nationality, socioeconomic background etc.) on the annotation process.
The last session will be constructed as an open workshop, aiming at drawing research questions and perspectives.
Accepted contributions:
Session 1Clément Le Ludec (Télécom Paris)
Long abstract:
Based on a field study carried out between France and Madagascar, we propose to analyse the construction of skills in the context of AI-induced professions, and in particular data annotation. What skills are used in data annotation? How are these skills prescribed by companies?
We combine the goal of making data workers visible with a reflection on the relative visibility of their skills in the value chain. We believe that what is invisible in the value chain is not so much the contribution of workers, but their level of skills and the level of expertise required to produce quality data.
We show that the level of workers' skills is made visible to other actors in the chain for specific purposes: guaranteeing data quality for AI companies, selling services for data annotation companies.Ultimately, what makes the workers' skills invisible is the coloniality of the value chain. They are considered unskilled because they are Malagasy, not because they do data annotation.
SJ Bennett (Durham University) Fabio Tollon (University of Edinburgh) Benedetta Catanzariti (University of Edinburgh)
Long abstract:
Much of today’s AI development requires a vast and distributed network of data workers who sort through, clean, and annotate the data used to train machine learning models. However, this network is often represented asymmetrically with a central focus on the contributions of AI practitioners which are positioned as pivotal, whilst other forms of labour, such as annotation, are seen as ad hoc and with little cumulative impact. These representations draw upon practitioner accounts, but rarely interrogate their underlying assumptions. This paper investigates AI practitioners’ conceptions of how data annotation fits into data structures, their representations of the workers engaged with this type of data work, and how these representations shape data structures themselves. Drawing from workshops conducted with machine learning practitioners, we explore experiences of data ’wrangling’, or practices of data acquisition, cleaning, and annotation, as the point where AI practitioners interface with domain experts and data annotators. In exploring these practices, we move beyond the simple recognition of data workers’ ‘invisibility’ to examine the political role of epistemic framings of the data work that underpin AI development and how these framings can shape data workers’ agency. Finally, we reflect on the implications of our findings for developing more participatory and equitable approaches to AI.
Oceane Fiant (Université de technologie de Compiègne)
Long abstract:
The engineer's approach to the issue of machine learning models’ opacity might involve opting for simpler models (like linear regression or decision trees), or adopting techniques from the explainable artificial intelligence field (such as Local Interpretable Model-Agnostic Explanations, Shapley Additive Explanations, among others). The latter option simplifies understanding the model’s decision-making, for instance by highlighting how specific features influence the model’s output.
In my talk, I will present a case where the challenge of opacity is addressed through careful training set construction, rather than model explainability. This project, a collaboration between a pathologist and an engineer, aims to create a dataset of breast cancer tumor images. This dataset will then be used to train convolutional neural networks to identify tumor components on whole slide images. The goal of my presentation is to review this project’s innovative solution and to derive broader insights regarding the issue of explainability of artificial intelligence systems.
Janina Zakrzewski (Weizenbaum-Institute)
Long abstract:
With AI systems increasingly integrated into high stakes contexts, we can also see machine learning (ML) in medical care and research gaining pronounced attention for its possibilities in ameliorating healthcare. For instance, in settings such as medical image analysis in radiology, ML implementation promises to assist decision making as well as serving as tools in scientific research. Data both in its large volumes, sourced across the complex socio-technical system of health care, as well as its quality are integral to the functioning of ML. This development has also led to a reconfiguration of data work and occupational expertise in health care in order to accommodate the need for data annotation of medical care and research (Bossen et al., 2019). Within this opaque endeavor, further empirical analysis within data annotation is crucial to develop an understanding of how developers of data annotation and professionals who perform data annotation tasks collaborate and shape data sets via their data collection as well as modes of documenting patient health information (e.g. medical conditions, symptoms, treatments, medical images) in order to also translate it for ML medical research purposes. This paper seeks to contribute to this need by investigating the design and decision making processes that underpin the development of data annotation for medical research via ML. In order to do so, it draws on empirical findings of an ongoing ethnography of a large scale health data project which also designs data annotation for broad ML medical research adoption.
Nanna Thylstrup
Long abstract:
This article delves into the intersection of machine learning models and news production, focusing on data annotation practices within a Danish news organization. Through an ethnographic lens, the study examines how data workers (data annotators, model builders and knowledge brokers) within a news organization navigate the integration of AI-driven solutions into the editorial process. Drawing on six months of fieldwork conducted in the "backroom of data science," the research sheds light on the intricacies of articulating, aligning, and challenging editorial values through the development of transformer models for news production.
Methodologically, the paper zooms in on the in-house manual annotation process of datasets used for article generation and recommender systems, highlighting the uncertainties, frictions, and negotiations inherent in these procedures. Theoretical underpinnings stem from critical data studies and Science and Technology Studies (STS) frameworks, specifically focusing on "critical dataset studies", the "logic of domains," and the concept of "science frictions." By leveraging these frameworks, the study elucidates the complexities that arise when different domains, such as journalism and data science, converge.
The findings contribute to our understanding of the ways in which data annotators negotiate the tension between reflecting and shaping the world, shedding light on the ways in which editorial values are negotiated and redefined in the digital age. Furthermore the paper contributes to work that sheds light on the backstage micro-decisions that shape contemporary news production in a digital age.
So Yeon Leem (Dong-A University)
Long abstract:
The rise of Large Language Models (LLMs) like ChatGPT is reshaping the social discourse on AI technology, bringing copyright concerns of texts and images used in AI training data to the forefront. Notably, the New York Times' December 2023 lawsuit against OpenAI and Microsoft over copyrighted content in ChatGPT's training data exemplifies this shift. This study scrutinizes whether financial compensation for copyright infringement is the sole ethical countermeasure to data extractivism, which inherently regards all content, copyrighted or not, as mere data for AI enhancement. I explore data extractivism through the lens of the low-wage, non-professional laborers tasked with converting meticulously written texts into 'AI fodder' following pre-set manuals. Since 2022, I have been conducting participant observation within a Korean multidisciplinary team developing an AI model for generating novels in English and Korean. This research highlights a group of experts—primarily literature PhD candidates—transforming novel texts into AI training data. Their professional expertise and passion for literature underscore the complexities of reducing literary works to data. Their anxiety and frustration, I argue, affirm that annotation involves 'text with care(Leedham et al., 2021),' presenting an opportunity for developing AI ethically, in opposition to data extractivism
Seyi Olojo (University of California, Berkeley)
Long abstract:
This paper presents insights from an interview study conducted with native Nigerian-language speakers. We explore community-based perceptions of effective annotation practice for three Nigerian languages: Yoruba, Igbo and Hausa. Participants discuss ideal annotation practices that diverge from the dominant 'Anglophone lens', ways of knowing that characterize the colonial and hegemonic power of the English language. In doing so, they redefine annotation expertise on their own terms, integrating cultural norms that are inherent to their identity. Therefore, the presentation of such “non-traditional” expertise both exposes and challenges the epistemic violence of Western annotation practices. Additionally, such indigenous expertise provides an opportunity for the ethical and accurate development of machine translation for Nigerian languages.
Héloïse Eloi-Hammer (Sciences Po Paris)
Long abstract:
This paper is based on a field work conducted in a French startup providing a “predictive justice” tool. This tool allows legal experts to estimate the probable outcome of a case by modeling the decision processes of a panel of a hundred fictional judges. In order to do so, it dwells on the annotation and analysis of past court decisions that feed the artificial intelligence model. Aiming at better understanding how such tools are conceived, this contribution will take as its object the annotation process implemented by the legaltech.
The annotation in itself is handled mostly by law students from outside the company, but the analysis grid is designed by law experts who work for the startup. While this work division was stable over time, the annotation process evolved: after using Excel spreadsheets to analyze court decisions, law students were provided with a software developed by the legaltech.
Based on an interview survey and on a study of documents obtained in the field, this paper will analyze the role and metamorphoses of the annotation process, and show how it has been shaped by the needs of interdisciplinary dialogue. Indeed, the annotation process is at the heart of multiple “translation” (Callon, 1986) processes: mathematicians, developers and law experts have to collaborate to produce a tool both efficient and useful. In this context, the result of the annotation can be seen as an hybrid language, aiming at satisfying both legal reasoning and the needs of computer scientists.