to star items.

Accepted Paper

The Price of Gaining Access: NSFW Pipeline Analysis  
Ella Streefkerk (Goethe University Frankfurt) Paula Helm (Goethe University Frankfurt)

Send message to Authors

Paper short abstract

NSFW (not-safe-for-work) data has repeatedly been flagged in ML training datasets. Instead of addressing this as a dataset issue, we discuss a mixed methods approach tracing NSFW platform taxonomies across dataset acquisition, model training, and system outputs while reflecting on ethical tensions.

Paper long abstract

Despite filtering efforts, not-safe-for-work (NSFW) data has recurrently been identified in large-scale multimodal training datasets, largely due to web-scraping (Birhane et al. 2021; Thiel 2023). Such content is commonly framed as a dataset problem focused on explicit content detection. With the development of foundation models, emergent video AI systems, and decreasing insistence on guardrails, the societal implications of NSFW data have become a matter of public concern (Birhane et al. 2024).

Existing approaches focused on dataset cleaning are insufficient, as treating NSFW content solely as a dataset issue overlooks the broader taxonomies scraped directly from NSFW platforms. These structure subjects through intersectional racialised and sexualised classifications, which shape downstream ML systems. Platforms thus function as large-scale annotation infrastructures whose logics move beyond their original context. To attend to these sidelined dimensions of NSFW data, we discuss a pipeline approach tracing NSFW content across platforms, datasets, ML processes, and model outputs. This requires a mixed methods approach, from quantitative platform taxonomy analysis to investigations in development practice and qualitative interpretation of system outputs.

In studying the more opaque stages, such as dataset curation and model training, access is usually restricted. This introduces ethical tensions, as researchers may need to ally with projects whose goals they do not embrace to acquire necessary insider information. Furthermore, sustained exposure to explicit content and taxonomies can impact researchers' well-being. Addressing these tensions requires reflexive methodological practices to situate the auditing of NSFW infrastructures in broader debates on research ethics and data labour.

Traditional Open Panel P043
The matter of method in researching AI: elusiveness, scale, opacity
  Session 2