Click the star to add/remove an item to/from your individual schedule.
You need to be logged in to avail of this functionality.
Log in
- Convenors:
-
Juana Catalina Becerra Sandoval
(IBM Research)
Edward B. Kang (New York University)
Send message to Convenors
- Format:
- Traditional Open Panel
Short Abstract:
Machine listening AI systems are increasingly being used across medical, financial, and security infrastructures. This panel explores the epistemic question of what it means to listen, and more specifically, how listening is transformed through the essentialist logics of artificial intelligence.
Long Abstract:
Listening through and with machines has a centuries-long history in the form of technologies like the stethoscope, sound spectrograph, and telephone, among others. The more recent development of artificial intelligence (AI) technologies, however, that extract, collect, quantify, and parametrize sounds on an unprecedented scale to manage information, make predictions, and generate artificial media have positioned the intersection of AI and sound as “the ‘next frontier’ of AI/ML” (Kang 2023). Referred to as machine listening systems, these technologies are embedded into medical, financial, security, surveillance and workplace infrastructures with crucial implications for how society is and will be organized. In this way, machine listening systems add new valence to the epistemic question of what it means to listen, and more specifically, how listening – as a constructive epistemological process of projection, as opposed to reception – is transformed in and through the essentialist logics of artificial intelligence and machine learning (ML). Indeed, machine listening stands to reconfigure ideas around the body, identity, voice, and space, as well as complicate the relationship between ‘listening’ and ‘objectivity,’ especially in contexts such as law and science. To fill the gap in existing critical AI scholarship that has largely focused on computer vision, this panel invites Science & Technology Studies (STS) scholars interested in the relationship between AI and sound. This includes topics such as voice biometrics, acoustic gunshot detection, speech emotion recognition, accent-matching, and other forms of forensic and medical sound analysis, but also extends to machine listening systems that collect audio data for their use in AI models that transform and produce music and speech. We are especially keen on receiving submissions that engage with questions of epistemology and politics as articulated through feminist, critical race theory, crip, decolonial, and other frameworks grounded in material analyses of power.
Accepted papers:
Session 1Chiara Carboni (TU Dresden)
Short abstract:
Emerging research on vocal biomarkers for autism diagnosis mobilizes machine learning to enact the ‘autistic voice’ as an entity rooted in individual biology. How might histories of queer voices unsettle current attempts at making minoritarian identities legible through machine listening?
Long abstract:
Following technological developments in the field of artificial intelligence and ubiquitous computing, human voices are increasingly cast as a repository of rich information to be extracted through machine learning. In the medical field, the quest for ‘vocal biomarkers’ exemplifies this notion of the human voice as a correlate of, and an avenue into, a range of physio- and psychopathological states. Research on digital biomarkers, of which vocal biomarkers are a sub-type, aims at establishing a direct link between ‘the biological’ and ‘the digital’ by biologizing both the digital traces left behind by individuals, and the physio- and psychopathological conditions they are supposed to stand in for. Although no vocal biomarkers have been approved for clinical use yet, major investments are being made in their discovery, and they are being mobilized by medtech startups. This presentation asks what a voice can (be made to) do when enhanced through big data and machine learning. Specifically, it centers on research on vocal biomarkers for autism diagnosis. Targeting autism as a diagnosis troubled to this day by its lack of biomarkers, this emerging field of research mobilizes machine learning to enact the ‘autistic voice’ as an entity with stable and unchanging characteristics across individuals. Reading the history of the autistic voice, and its recent machine learning-fueled developments, in parallel to past and present attempts at biologizing queer voices, this presentation speculates on how queer theory and lavender linguistics might unsettle contemporary attempts at making minoritarian identities legible through machine listening.
Amina Abbas-Nazari (Royal College of Art)
Long abstract:
Voice increasingly mediates artificially intelligent (AI)-enabled communication, with the expanding proliferation of conversational AI systems like Amazon’s Echo, voiced by ‘Alexa’. My research investigates how voices are heard within machine listening-enabled systems, such as these. Currently, understandings of voice and vocal sounding by AI and the AI industry rely on voice profiling. These systems claim to be able to determine wide-ranging ‘bio-relevant facts’ about individuals, including physical, physiological, demographic, medical, psychological, behavioural, and sociological features. However, voice profiling relies on normative assumptions that risk misrepresenting individuals, negatively impacting those already marginalised. Using theory from sound and music practice, this paper will challenge and critique machine listening schemas through an expanded, more holistic comprehension of vocal sound and sounding. The tension and disparity of understanding voice in these differing fields of knowledge generates critique of vocal profiling practices in machine listening systems. Taking an intersectional feminist position, the voice is explored as a design material shaped through embodied, relational and situated scenarios, which I term Speculative Voicing. In turn, the voice is given increased agency and autonomy, highlighting the fallacy of a machine’s ability to listen and interpret voices. Practiced-based examples will accompany the paper to evidence the ideas being proposed. Situated at the intersection of sound, design and technology, this research incorporates contemporary societal discourse on identity politics, personhood, being and ecology.
Harshadha Balasubramanian (UCL)
Long abstract:
This paper foregrounds the listening practices of blind users in virtual reality (VR) to pose a critical intervention in the design of headphones that use artificial intelligence (AI) and machine learning (ML). My arguments leverage ethnographic data from fieldwork with designers and users seeking non-visual VR access, as well as my lived experience as a blind researcher working with VR.
I draw attention to increasingly sophisticated noise cancellation and audio personalisation in intelligent headphones (see Fan Et Al., 2021), and I argue that such design preferences have furthered the pursuit of ever more bounded, private spaces through mobile auditory devices (Bull, 2000). When encountering similar attempts to divert their attention from their physical surroundings in VR, blind users refuse to succumb. They sound out and attune to walls, furniture, and people in their physical environment so as to make sense of the virtual one. I share these examples to make a case for valuing noise: these intrusions of private acoustic spaces open possibilities for non-normative and collaborative ways of knowing in sound. Relatedly, I call on scholars to extend the application of teachable AI in computer vision (Morrison Et Al., 2023) to machine listening, imagining a future in which users can agentively direct how intelligent headphones listen.
Johann Diedrick (NYU)
Long abstract:
Machine listening technologies, specifically speech recognition, have been designed to not take into account variations in speech. This area of research has been (intentionally) neglected, and the creators of this technology have known that this problem has existed since the technology's earliest formulations.
Alexander Graham Bell, along with his father, devised a method of visualizing speech in order to educate the Deaf and hard of hearing to speak. This project of oralism is the starting point for disciplining bodies to conform to a type of regularized speech production, where one must perform speech for a hearing body (biological or technological) in order to be access services, society, and recognition of one’s identity, personhood, culture, and humanity.
This paper traces the early origins of producing a taxonomy of speech (Visible Speech), the development of theories of how speech is produced and how meaning is carried through vocal signal (Dudley and information theory), through the invention of electronic and digital technologies of speech recognition (and Bell Labs and others), and finally to our modern day technologies powered through AI systems, constructed with the logic of machine learning in mind. Through this interweaving is a desire to include how we as a society have evolved our thinking around speech and its recognition in popular culture, and how this has shifted how we relate to these technologies, our expectations around its function, and how we might internalize our own relationship to how we ought to speak in order to be heard by these technologies.
Tanja Knaus (University of Oslo)
Long abstract:
This paper traces the application of transformer network architectures to the domain of speech emotion recognition (SER). While there is much literature on computational linguistics and image recognition, the study of the material-semiotic specificity transformer architectures within the audio domain is limited. How does a tool for knowing the world through spatial objects in the visual realm, become a tool to for knowing unstable objects, such as emotions (e.g. inner mental states), through the sonic realm? And what does this type of ‘transfer learning’ say about a brain-inspired connectionist AI paradigm? I answer this question through an empirical study of a European research community that develop speech emotion recognition systems for the private and public sector. The principle that underlies the transformer is the fact that it is emptied out of theory. In other words, practitioners try to devoid the model of any domain knowledge to circumvent the issue of applying stable labels to unstable objects. This phenomenon becomes problematic in the context of SER that quantifies the human mind through vocal signals. This paper, therefore, highlights the limitations of achieving a 'general purpose model' that can be applied to the 'whole wide world' when confronted with the human mind, and the problematic fact that these systems try to include inner mental states as a formal category.
Areli Rocha
Long abstract:
This paper examines how users reflect and share their experiences of voicing contrasts through listening practices in artificial intelligence (AI) chatbots. I pay specific attention to multimodal semiotic signs that influence how “real” or “alive” users perceive a chatbot to be. I argue that what users describe as real/alive in relation to the bots refers to an iconization of humanness, following Irving and Gal's semiotic process of iconization. Through what I call reflexive texts, such as Reddit blog posts, users make sense of their experiences in deeply vulnerable ways with other people in digital spaces that function primarily for sociability. I draw on Jonathan Rosa and Nelson Flores’ “white listening subject,” placing the analysis on the listening subject instead of the racialized speaking subject, as well as Mikhail Bakhtin’s concept of heteroglossia as frameworks for thinking about the listening practices and multiplicity of voices implicit in the conversational exchanges with the chatbots and among users. I take users’ discussion posts about the relationships they develop with their distinct Replikas, AI conversational companion chatbots, as a case study. Amidst common anxieties of alienation and atomization through technological developments, particularly the kind that relationships with chatbots may evoke, Replika users do not recede into a social vacuum of user and Replika. Instead, user-Replika relationships coexist in conversations with others having similar experiences, participating in social life in its vast and mediated multiforms. Other users’ comments and advice shape how people interact with their Replikas.
Johan Malmstedt (Media and Communications Studies)
Long abstract:
Audio event detection (AED) is becoming an increasingly significant cog in contemporary human-machine interaction (Sterne, Mehak, Kalfou: 2022, Kang: 2022). Such algorithms are dependent on the metrical determination of similitude (Mackenzie: 2017). However, the task of determining similarity is never trivial and adapted to the framework of acoustics, it conceals a range of philosophical problems ranging from the objective qualities of the sound object to the mathematical problem of geometric distance. To start unpacking the assumption behind audio similarity metrics, this article examines and histories a sample of popular applications.
The analysis focuses on a set of contemporary models trained on the state-of-the-art data set AudioSet (Parker & Dockray: 2023). Close analysis of the data processing involved in these algorithms can reveal the key metrical intervention in producing similarity scores for audio files. This entails a critical reading of how audio fares under the condition of, for example, cosine distance, and K-means methods. After considering the implications of these components for the classification of sound, the analysis proceeds to trace the historical practices of these metrics, discussing the enduring influence of these mathematical trajectories on today's technological practices. In combining media theory and the history of applied mathematics, the aim is to contribute to our understanding of the cultural implications of sonic similarity in our contemporary digital landscape.
Felicia Jing (Johns Hopkins University)
Long abstract:
This paper explores the political and economic rationalities that are (re)produced by the compression algorithms involved in audio data transmission and formatting. While economic logics of compression are largely expressed through the terms of an empiricist paradigm—e.g., where a message is said to be compressed when all “excess” has been eliminated and its “essence” has been preserved (thereby producing surplus bandwidth)—the operation of identifying and eliminating excess sound exists within a pre-existing realm of the sensible, one that is entangled with political partitions of silence/sound and voice/noise that have already been drawn in advance. For instance, a number of political assumptions underwrite the construction of so-called “redundant” and “unnecessary” audio data—these are conceptions of legibility and intelligibility that are situated in a broader history of political thought, one that begins with Plato and Aristotle’s distinction between logos and phoné, or ‘reasoned speech’ and ‘noise’, which functioned to delineate the political subject from the non-subject. This paper seeks to surface the ways that audio data compression algorithms participate in this legacy through the construction of models that trace these distributions of sense and subjects. Then, drawing from a rich critical tradition invested in the deconstruction of such notions of the subject and of logos (in particular, from the work of Jacques Derrida and Jacques Rancière), AI models can be understood to always leave behind an residue that cannot be contained by orders of the sensible.