to star items.

Accepted Contribution

Toward Robust Evaluation of Synthetic Data: Lessons from LLM Bias and Temporal Representation  
Vera Danilova (Uppsala University) Julia Reed

Short abstract

Synthetic data is used for training and benchmarking AI models. However, research on LLM biases and temporal representation suggests that systematic distortions can affect generated data. We integrate insights from previous research to advance systematic evaluation practices for synthetic datasets.

Long abstract

AI training pipelines often face shortages of high-quality, diverse, and ethically shareable data. To address this limitation, synthetic data is increasingly used for training and benchmarking, to mitigate data scarcity and avoid privacy concerns. Alongside large language models (LLMs), smaller models are also used to generate and augment textual, numeric, and mixed datasets.

At the same time, a growing body of research shows that LLMs exhibit systematic biases regardless of model size. Some biases resemble human cognitive or social patterns, while others—such as centrality bias—appear to be artifacts of model training. When LLM-generated synthetic data is reused in training and evaluation pipelines, these distortions risk propagating through the entire data cycle of generation, training, and benchmarking.

Post-training methods such as instruction tuning with human feedback can improve a model’s ability to produce structured outputs, including tabular data. However, these techniques may also intensify certain preference biases.

Temporal biases are particularly important because real-world datasets are inherently temporal and contain complex relationships between events and time. LLMs often struggle to capture these relationships reliably, instead reproducing learned associations that may reflect spurious correlations.

In our recent work, we used synthetic data generation to examine the interaction between inherent LLM biases and representations of historical reality in the extremely low-resourced domain of historical medical periodicals. Building on this case study, we propose integrating research on synthetic data evaluation with studies of LLM biases and temporal distortions to support more robust evaluation practices.

Combined Format Open Panel CB027
Synthetic data and representation: The politics of AI generated computational practices
  Session 1