Click the star to add/remove an item to/from your individual schedule.
You need to be logged in to avail of this functionality.
Log in
- Convenors:
-
Camille Girard-Chanudet
(EHESS)
Assia Wirth
Send message to Convenors
- Format:
- Combined Format Open Panel
- Location:
- HG-11A33
- Sessions:
- Friday 19 July, -, -
Time zone: Europe/Amsterdam
Short Abstract:
This panel focuses on manual data annotation for machine learning purposes. It welcomes empirical studies conducted with annotators, documenting the doubts, inquiries, and choices, that progressively shape the training data sets and, thus, the results produced by AI.
Long Abstract:
The growing implementation of Artificial Intelligence (AI) technologies in “sensitive” fields such as health, justice or surveillance has triggered diverse preoccupations about algorithmic opacity. The efforts to crack open the “black box” of machine learning have mainly focused on coding architectures and practices, on the one hand, and on the constitution of training data sets, on the other hand. Both these components of machine learning dispositives are however held together by an essential link, which has largely remained on the fringes of AI studies: the manual annotation of data by human professionals.
Annotation work consists of the manual and meticulous labeling of documents (pictures, texts…) with a desired outcome that the algorithmic model will then reproduce. It can be undertaken by various categories of actors. While annotation conducted by underpaid and outsourced workers has been well documented, these activities can also be assumed by qualified workers even within prestigious professions. All of these empirical cases raise questions regarding the micro doubts, inquiries, choices, and overall expertise, that progressively shape the training data sets and, thus, the results produced by AI. Data labelling is about putting classification systems into practice, defining categories and their empirical borders, and constructing information infrastructures. Despite these strong political impacts, data annotation remains highly invisible.
This panel welcomes papers addressing the question of annotation from an empirical perspective. The contributions may include:
- Ethnographic studies documenting the practices of annotation work in particular contexts.
- Historical studies situating annotation work for AI in a broader genealogy of classification instruments and inscription practices.
- Organizational studies describing the effects of different institutional settings (in-house, outsourced, subcontracted etc.) and social configurations (gender, nationality, socioeconomic background etc.) on the annotation process.
The last session will be constructed as an open workshop, aiming at drawing research questions and perspectives.
Accepted contributions:
Session 1 Friday 19 July, 2024, -Short abstract:
This paper empirically investigates AI practitioners’ conceptions of how data annotation fits into data structures, their representations of the workers engaged with this type of data work, and how these representations shape data structures themselves.
Long abstract:
Much of today’s AI development requires a vast and distributed network of data workers who sort through, clean, and annotate the data used to train machine learning models. However, this network is often represented asymmetrically with a central focus on the contributions of AI practitioners which are positioned as pivotal, whilst other forms of labour, such as annotation, are seen as ad hoc and with little cumulative impact. These representations draw upon practitioner accounts, but rarely interrogate their underlying assumptions. This paper investigates AI practitioners’ conceptions of how data annotation fits into data structures, their representations of the workers engaged with this type of data work, and how these representations shape data structures themselves. Drawing from workshops conducted with machine learning practitioners, we explore experiences of data ’wrangling’, or practices of data acquisition, cleaning, and annotation, as the point where AI practitioners interface with domain experts and data annotators. In exploring these practices, we move beyond the simple recognition of data workers’ ‘invisibility’ to examine the political role of epistemic framings of the data work that underpin AI development and how these framings can shape data workers’ agency. Finally, we reflect on the implications of our findings for developing more participatory and equitable approaches to AI.
Short abstract:
Exploring ethical AI development, this study focuses on literature experts annotating texts for AI in Korea, highlighting their struggles against data extractivism and advocating for 'text with care' as an ethical AI development practice.
Long abstract:
The rise of Large Language Models (LLMs) like ChatGPT is reshaping the social discourse on AI technology, bringing copyright concerns of texts and images used in AI training data to the forefront. Notably, the New York Times' December 2023 lawsuit against OpenAI and Microsoft over copyrighted content in ChatGPT's training data exemplifies this shift. This study scrutinizes whether financial compensation for copyright infringement is the sole ethical countermeasure to data extractivism, which inherently regards all content, copyrighted or not, as mere data for AI enhancement. I explore data extractivism through the lens of the low-wage, non-professional laborers tasked with converting meticulously written texts into 'AI fodder' following pre-set manuals. Since 2022, I have been conducting participant observation within a Korean multidisciplinary team developing an AI model for generating novels in English and Korean. This research highlights a group of experts—primarily literature PhD candidates—transforming novel texts into AI training data. Their professional expertise and passion for literature underscore the complexities of reducing literary works to data. Their anxiety and frustration, I argue, affirm that annotation involves 'text with care(Leedham et al., 2021),' presenting an opportunity for developing AI ethically, in opposition to data extractivism
Short abstract:
This presentation centers on an AI project in pathology, addressing AI opacity in a unique way. Rather than striving to make the models explainable, the focus is on meticulously building their training sets. From this case, I aim to draw broader lessons about explainability of AI systems.
Long abstract:
The engineer's approach to the issue of machine learning models’ opacity might involve opting for simpler models (like linear regression or decision trees), or adopting techniques from the explainable artificial intelligence field (such as Local Interpretable Model-Agnostic Explanations, Shapley Additive Explanations, among others). The latter option simplifies understanding the model’s decision-making, for instance by highlighting how specific features influence the model’s output.
In my talk, I will present a case where the challenge of opacity is addressed through careful training set construction, rather than model explainability. This project, a collaboration between a pathologist and an engineer, aims to create a dataset of breast cancer tumor images. This dataset will then be used to train convolutional neural networks to identify tumor components on whole slide images. The goal of my presentation is to review this project’s innovative solution and to derive broader insights regarding the issue of explainability of artificial intelligence systems.
Short abstract:
This article explores machine learning's impact on news production, focusing on data annotation practices in a Danish news organization. Through ethnographic study, it examines how data workers shape editorial processes, revealing tensions and negotiations.
Long abstract:
This article delves into the intersection of machine learning models and news production, focusing on data annotation practices within a Danish news organization. Through an ethnographic lens, the study examines how data workers (data annotators, model builders and knowledge brokers) within a news organization navigate the integration of AI-driven solutions into the editorial process. Drawing on six months of fieldwork conducted in the "backroom of data science," the research sheds light on the intricacies of articulating, aligning, and challenging editorial values through the development of transformer models for news production.
Methodologically, the paper zooms in on the in-house manual annotation process of datasets used for article generation and recommender systems, highlighting the uncertainties, frictions, and negotiations inherent in these procedures. Theoretical underpinnings stem from critical data studies and Science and Technology Studies (STS) frameworks, specifically focusing on "critical dataset studies", the "logic of domains," and the concept of "science frictions." By leveraging these frameworks, the study elucidates the complexities that arise when different domains, such as journalism and data science, converge.
The findings contribute to our understanding of the ways in which data annotators negotiate the tension between reflecting and shaping the world, shedding light on the ways in which editorial values are negotiated and redefined in the digital age. Furthermore the paper contributes to work that sheds light on the backstage micro-decisions that shape contemporary news production in a digital age.
Short abstract:
This study presents findings from a study conducted with Nigerian-language speakers. We explore indigenous perceptions of annotation practice for Yoruba, Igbo and Hausa. Participants redefine annotation expertise on their own terms, integrating cultural norms that are inherent to their identity.
Long abstract:
This paper presents insights from an interview study conducted with native Nigerian-language speakers. We explore community-based perceptions of effective annotation practice for three Nigerian languages: Yoruba, Igbo and Hausa. Participants discuss ideal annotation practices that diverge from the dominant 'Anglophone lens', ways of knowing that characterize the colonial and hegemonic power of the English language. In doing so, they redefine annotation expertise on their own terms, integrating cultural norms that are inherent to their identity. Therefore, the presentation of such “non-traditional” expertise both exposes and challenges the epistemic violence of Western annotation practices. Additionally, such indigenous expertise provides an opportunity for the ethical and accurate development of machine translation for Nigerian languages.
Short abstract:
This paper is based on a fieldwork carried out in a French startup offering a “predictive justice” service. It will question the role of the legal experts annotating court decisions in order to feed AI models, and analyze the evolution of this annotation process over time.
Long abstract:
This paper is based on a field work conducted in a French startup providing a “predictive justice” tool. This tool allows legal experts to estimate the probable outcome of a case by modeling the decision processes of a panel of a hundred fictional judges. In order to do so, it dwells on the annotation and analysis of past court decisions that feed the artificial intelligence model. Aiming at better understanding how such tools are conceived, this contribution will take as its object the annotation process implemented by the legaltech.
The annotation in itself is handled mostly by law students from outside the company, but the analysis grid is designed by law experts who work for the startup. While this work division was stable over time, the annotation process evolved: after using Excel spreadsheets to analyze court decisions, law students were provided with a software developed by the legaltech.
Based on an interview survey and on a study of documents obtained in the field, this paper will analyze the role and metamorphoses of the annotation process, and show how it has been shaped by the needs of interdisciplinary dialogue. Indeed, the annotation process is at the heart of multiple “translation” (Callon, 1986) processes: mathematicians, developers and law experts have to collaborate to produce a tool both efficient and useful. In this context, the result of the annotation can be seen as an hybrid language, aiming at satisfying both legal reasoning and the needs of computer scientists.