Illusion of Reproducibility in Medical Machine Learning
Veronika Cheplygina
(IT University of Copenhagen)
Théo Sourget
(IT University of Copenhagen)
Amelia Jiménez-Sánchez
(IT University of Copenhagen)
Paper Short Abstract
Machine learning (ML) diagnosis of medical images attracts a lot of attention, yet progress in clinical practice has not been proportional. Despite larger datasets and open-source tools, reproducibility is a problem. In this talk we dive into these problems, and what changes are needed in future.
Paper Abstract
Machine learning (ML) diagnosis of medical images has attracted a lot of attention recently, with some claims of surpassing expert-level performance, yet progress in clinical practice has not been proportional.
The increased popularity is often explained by two developments. First, there are several large publicly available datasets. Second, open source ML toolboxes allow development of algorithms with less domain expertise.
Despite these seemingly ideal conditions for reproducibility, there are several issues, in this talk I will highlight two. One issue is that large sample sizes are not a panacea. There is a tendency to expect that a clinical task can be “solved” if the dataset is large enough. However, not all clinical tasks translate neatly into ML tasks, and creating larger datasets can come at the expense of data quality.
Another reason is the availability of data and code, plus the option to “infinitely” run experiments (with different subsets of data, different parameters, etc), which creates an illusion of reproducibility. Even if a code repository is available, it might not be clear (1) what data was used - as data might not be cited, or can be a derivative of a public dataset, or (2) or how many different experiments were also run, but that are not in the repository.
In this talk we dive deeper into these problems and hopefully, with the help of the audience, also explore some solutions. We will also touch upon various incentives in ML and academia that interact with these findings.
Accepted Poster
Paper Short Abstract
Paper Abstract
Machine learning (ML) diagnosis of medical images has attracted a lot of attention recently, with some claims of surpassing expert-level performance, yet progress in clinical practice has not been proportional.
The increased popularity is often explained by two developments. First, there are several large publicly available datasets. Second, open source ML toolboxes allow development of algorithms with less domain expertise.
Despite these seemingly ideal conditions for reproducibility, there are several issues, in this talk I will highlight two. One issue is that large sample sizes are not a panacea. There is a tendency to expect that a clinical task can be “solved” if the dataset is large enough. However, not all clinical tasks translate neatly into ML tasks, and creating larger datasets can come at the expense of data quality.
Another reason is the availability of data and code, plus the option to “infinitely” run experiments (with different subsets of data, different parameters, etc), which creates an illusion of reproducibility. Even if a code repository is available, it might not be clear (1) what data was used - as data might not be cited, or can be a derivative of a public dataset, or (2) or how many different experiments were also run, but that are not in the repository.
In this talk we dive deeper into these problems and hopefully, with the help of the audience, also explore some solutions. We will also touch upon various incentives in ML and academia that interact with these findings.
Poster session
Session 1 Tuesday 1 July, 2025, -