Click the star to add/remove an item to/from your individual schedule.
You need to be logged in to avail of this functionality.
Log in
- Convenors:
-
Juana Catalina Becerra Sandoval
(IBM Research)
Edward B. Kang (New York University)
Send message to Convenors
- Format:
- Traditional Open Panel
- Location:
- NU-5A47
- Sessions:
- Wednesday 17 July, -
Time zone: Europe/Amsterdam
Short Abstract:
Machine listening AI systems are increasingly being used across medical, financial, and security infrastructures. This panel explores the epistemic question of what it means to listen, and more specifically, how listening is transformed through the essentialist logics of artificial intelligence.
Long Abstract:
Listening through and with machines has a centuries-long history in the form of technologies like the stethoscope, sound spectrograph, and telephone, among others. The more recent development of artificial intelligence (AI) technologies, however, that extract, collect, quantify, and parametrize sounds on an unprecedented scale to manage information, make predictions, and generate artificial media have positioned the intersection of AI and sound as “the ‘next frontier’ of AI/ML” (Kang 2023). Referred to as machine listening systems, these technologies are embedded into medical, financial, security, surveillance and workplace infrastructures with crucial implications for how society is and will be organized. In this way, machine listening systems add new valence to the epistemic question of what it means to listen, and more specifically, how listening – as a constructive epistemological process of projection, as opposed to reception – is transformed in and through the essentialist logics of artificial intelligence and machine learning (ML). Indeed, machine listening stands to reconfigure ideas around the body, identity, voice, and space, as well as complicate the relationship between ‘listening’ and ‘objectivity,’ especially in contexts such as law and science. To fill the gap in existing critical AI scholarship that has largely focused on computer vision, this panel invites Science & Technology Studies (STS) scholars interested in the relationship between AI and sound. This includes topics such as voice biometrics, acoustic gunshot detection, speech emotion recognition, accent-matching, and other forms of forensic and medical sound analysis, but also extends to machine listening systems that collect audio data for their use in AI models that transform and produce music and speech. We are especially keen on receiving submissions that engage with questions of epistemology and politics as articulated through feminist, critical race theory, crip, decolonial, and other frameworks grounded in material analyses of power.
Accepted papers:
Session 1 Wednesday 17 July, 2024, -Short abstract:
Emerging research on vocal biomarkers for autism diagnosis mobilizes machine learning to enact the ‘autistic voice’ as an entity rooted in individual biology. How might histories of queer voices unsettle current attempts at making minoritarian identities legible through machine listening?
Long abstract:
Following technological developments in the field of artificial intelligence and ubiquitous computing, human voices are increasingly cast as a repository of rich information to be extracted through machine learning. In the medical field, the quest for ‘vocal biomarkers’ exemplifies this notion of the human voice as a correlate of, and an avenue into, a range of physio- and psychopathological states. Research on digital biomarkers, of which vocal biomarkers are a sub-type, aims at establishing a direct link between ‘the biological’ and ‘the digital’ by biologizing both the digital traces left behind by individuals, and the physio- and psychopathological conditions they are supposed to stand in for. Although no vocal biomarkers have been approved for clinical use yet, major investments are being made in their discovery, and they are being mobilized by medtech startups. This presentation asks what a voice can (be made to) do when enhanced through big data and machine learning. Specifically, it centers on research on vocal biomarkers for autism diagnosis. Targeting autism as a diagnosis troubled to this day by its lack of biomarkers, this emerging field of research mobilizes machine learning to enact the ‘autistic voice’ as an entity with stable and unchanging characteristics across individuals. Reading the history of the autistic voice, and its recent machine learning-fueled developments, in parallel to past and present attempts at biologizing queer voices, this presentation speculates on how queer theory and lavender linguistics might unsettle contemporary attempts at making minoritarian identities legible through machine listening.
Short abstract:
This paper examines the history of speech recognition technology, from pre-electronic formulations to contemporary AI systems, to understand how bias has been embedded in these systems throughout their history of development.
Long abstract:
Machine listening technologies, specifically speech recognition, have been designed to not take into account variations in speech. This area of research has been (intentionally) neglected, and the creators of this technology have known that this problem has existed since the technology's earliest formulations.
Alexander Graham Bell, along with his father, devised a method of visualizing speech in order to educate the Deaf and hard of hearing to speak. This project of oralism is the starting point for disciplining bodies to conform to a type of regularized speech production, where one must perform speech for a hearing body (biological or technological) in order to be access services, society, and recognition of one’s identity, personhood, culture, and humanity.
This paper traces the early origins of producing a taxonomy of speech (Visible Speech), the development of theories of how speech is produced and how meaning is carried through vocal signal (Dudley and information theory), through the invention of electronic and digital technologies of speech recognition (and Bell Labs and others), and finally to our modern day technologies powered through AI systems, constructed with the logic of machine learning in mind. Through this interweaving is a desire to include how we as a society have evolved our thinking around speech and its recognition in popular culture, and how this has shifted how we relate to these technologies, our expectations around its function, and how we might internalize our own relationship to how we ought to speak in order to be heard by these technologies.
Short abstract:
This article explores audio event detection algorithms for human-machine interaction, analyzing the philosophical and mathematical challenges of audio similarity through both historical and media theoretical perspectives.
Long abstract:
Audio event detection (AED) is becoming an increasingly significant cog in contemporary human-machine interaction (Sterne, Mehak, Kalfou: 2022, Kang: 2022). Such algorithms are dependent on the metrical determination of similitude (Mackenzie: 2017). However, the task of determining similarity is never trivial and adapted to the framework of acoustics, it conceals a range of philosophical problems ranging from the objective qualities of the sound object to the mathematical problem of geometric distance. To start unpacking the assumption behind audio similarity metrics, this article examines and histories a sample of popular applications.
The analysis focuses on a set of contemporary models trained on the state-of-the-art data set AudioSet (Parker & Dockray: 2023). Close analysis of the data processing involved in these algorithms can reveal the key metrical intervention in producing similarity scores for audio files. This entails a critical reading of how audio fares under the condition of, for example, cosine distance, and K-means methods. After considering the implications of these components for the classification of sound, the analysis proceeds to trace the historical practices of these metrics, discussing the enduring influence of these mathematical trajectories on today's technological practices. In combining media theory and the history of applied mathematics, the aim is to contribute to our understanding of the cultural implications of sonic similarity in our contemporary digital landscape.
Short abstract:
Audio compression algorithms operate by identifying & eliminating excess sound, thereby reproducing political partitions between voice/noise. This paper situates these partitions in the history of political thought as products of struggles over definitions of logos and the political subject.
Long abstract:
This paper explores the political and economic rationalities that are (re)produced by the compression algorithms involved in audio data transmission and formatting. While economic logics of compression are largely expressed through the terms of an empiricist paradigm—e.g., where a message is said to be compressed when all “excess” has been eliminated and its “essence” has been preserved (thereby producing surplus bandwidth)—the operation of identifying and eliminating excess sound exists within a pre-existing realm of the sensible, one that is entangled with political partitions of silence/sound and voice/noise that have already been drawn in advance. For instance, a number of political assumptions underwrite the construction of so-called “redundant” and “unnecessary” audio data—these are conceptions of legibility and intelligibility that are situated in a broader history of political thought, one that begins with Plato and Aristotle’s distinction between logos and phoné, or ‘reasoned speech’ and ‘noise’, which functioned to delineate the political subject from the non-subject. This paper seeks to surface the ways that audio data compression algorithms participate in this legacy through the construction of models that trace these distributions of sense and subjects. Then, drawing from a rich critical tradition invested in the deconstruction of such notions of the subject and of logos (in particular, from the work of Jacques Derrida and Jacques Rancière), AI models can be understood to always leave behind an residue that cannot be contained by orders of the sensible.
Short abstract:
This paper traces the application of transformer network architectures to the domain of speech emotion recognition (SER). I aim to highlight the limitations of achieving a 'general purpose model' that can be applied to the 'whole wide world' when confronted with the human mind.
Long abstract:
This paper traces the application of transformer network architectures to the domain of speech emotion recognition (SER). While there is much literature on computational linguistics and image recognition, the study of the material-semiotic specificity transformer architectures within the audio domain is limited. How does a tool for knowing the world through spatial objects in the visual realm, become a tool to for knowing unstable objects, such as emotions (e.g. inner mental states), through the sonic realm? And what does this type of ‘transfer learning’ say about a brain-inspired connectionist AI paradigm? I answer this question through an empirical study of a European research community that develop speech emotion recognition systems for the private and public sector. The principle that underlies the transformer is the fact that it is emptied out of theory. In other words, practitioners try to devoid the model of any domain knowledge to circumvent the issue of applying stable labels to unstable objects. This phenomenon becomes problematic in the context of SER that quantifies the human mind through vocal signals. This paper, therefore, highlights the limitations of achieving a 'general purpose model' that can be applied to the 'whole wide world' when confronted with the human mind, and the problematic fact that these systems try to include inner mental states as a formal category.