Increasing the Reproducibility and Reuse of Partially Shared Data: Limiting the Impacts of Post-Analytic Attrition via Synthetic Data Recreation
Christine White
Jeffrey Shero
(Vanderbilt University)
Stephanie Estrera
(Florida State University)
Paper Short Abstract
Shared data allows research to be reproduced and new insights to be generated, but participants opting out of having their data shared can significantly limit this potential. As such, this study explores techniques to synthetically recreate non-shared data, ensuring better reproducibility and reuse.
Paper Abstract
Recent traction in open science and data sharing has led to many granting agencies requiring data to be publicly shared at the time results are published. This, in theory, leads to greater reproducibility of study results and allows for novel insights through secondary data analysis. However, the notion of publicly shared data may exacerbate existing concerns about participant privacy, and participants included in the original analyses may not consent to having their data shared publicly, a circumstance which we refer to as post-analytic attrition (PAA). The sharing of partial data sets due to PAA significantly limits potential for reproducibility and reuse; however, there are few other options for researchers who must respect the wishes of both their grant agencies and their participants. The present study introduces three techniques to supplement partially shared data sets, and examines the extent to which each approach improves reproducibility and potential for reuse. Publicly available data were repeatedly sampled to represent varying levels of PAA and rates of differential PAA by participant demographics. Covariance-based simulation, model-based simulation, and machine learning were used to recreate the non-shared data as a supplement to the original dataset. Using this synthetically generated data, we then attempted to reproduce the original study findings and recover its initial structure. Results were compared across the simulation techniques and levels of PAA/differential PAA. Future directions are also discussed, with particular focus on balancing participant privacy and data reuse, as well as prioritizing reproducibility versus general reuse.
Accepted Poster
Paper Short Abstract
Paper Abstract
Recent traction in open science and data sharing has led to many granting agencies requiring data to be publicly shared at the time results are published. This, in theory, leads to greater reproducibility of study results and allows for novel insights through secondary data analysis. However, the notion of publicly shared data may exacerbate existing concerns about participant privacy, and participants included in the original analyses may not consent to having their data shared publicly, a circumstance which we refer to as post-analytic attrition (PAA). The sharing of partial data sets due to PAA significantly limits potential for reproducibility and reuse; however, there are few other options for researchers who must respect the wishes of both their grant agencies and their participants. The present study introduces three techniques to supplement partially shared data sets, and examines the extent to which each approach improves reproducibility and potential for reuse. Publicly available data were repeatedly sampled to represent varying levels of PAA and rates of differential PAA by participant demographics. Covariance-based simulation, model-based simulation, and machine learning were used to recreate the non-shared data as a supplement to the original dataset. Using this synthetically generated data, we then attempted to reproduce the original study findings and recover its initial structure. Results were compared across the simulation techniques and levels of PAA/differential PAA. Future directions are also discussed, with particular focus on balancing participant privacy and data reuse, as well as prioritizing reproducibility versus general reuse.
Poster session
Session 1 Tuesday 1 July, 2025, -