Evidence synthesis is often placed on the top of the hierarchy of scientific evidence and there is an increasing amount of meta-analyses and systematic reviews produced every year. This session will focus on new tools, methods and the credibility of evidence synthesis.
Long Abstract
This session highlights innovations and challenges in evidence synthesis, meta-analysis, and reproducibility tools. Robert Emprechtinger introduces metaHelper, an R package and web app that streamlines statistical transformations in meta-analysis, featuring effect size conversions and an evaluation via a randomized controlled trial. Sean Smith presents SOLES—a Systematic Online Living Evidence Summary—leveraging machine-assisted screening, AI-driven annotations, and community input to track interventions for improving reproducibility. Kristen Scotti talks about a large-scale review of 2,253 systematic reviews reveals a sharp rise in machine learning use for screening (from 0.6% in 2018 to 12.8% in 2024), though broader ML applications remain rare and underreported. Thomas Starck reports on a meta-research study of 6,294 Cochrane reviews shows that only 7% of evidence gradings were rated high quality, with no improvement over 15 years. Kinga Bierwiaczonek examines heterogeneity reporting in 1,207 psychological meta-analyses, finding that 22–41% omit heterogeneity entirely, and when reported, it is often ignored in conclusions. Finally, Maximilian Frank proposes a new metadata standard in scientific publishing aims to embed key research elements—such as hypotheses and test statistics—in machine-readable formats to support automated synthesis and transparency.
metaHelper is an R package and web application that simplifies statistical transformations in meta-analysis, making effect size conversions and related calculations more accessible. This presentation covers its features, practical applications, and evaluation in a randomized controlled trial (RCT).
Long abstract
metaHelper is an R package and web application designed to simplify statistical transformations in meta-analysis. It provides user-friendly tools for converting between common effect sizes, calculating standard errors, and handling transformations required for meta-analytic workflows. The web application offers an intuitive interface, making these methods accessible to researchers without advanced programming skills, while the R package allows direct integration into analysis pipelines.
metaHelper supports key effect size measures, including odds ratios, standardized mean differences, and correlation coefficients, ensuring compatibility with various meta-analytic approaches. The tool addresses common challenges in meta-analysis by reducing errors and streamlining calculations.
To assess its usability and impact, a randomized controlled trial (RCT) evaluated metaHelper’s effectiveness in improving accuracy and efficiency compared to traditional methods. The metaHelper group had a higher probability of providing correct answers (85 percent) compared to the control group (31 percent). Additionally, the metaHelper group completed an average of 133 seconds per task faster (95% CrI: 83 to 180).
We will showcase the core features of metaHelper, demonstrate its practical applications, and discuss insights from its evaluation. By providing an accessible solution for statistical transformations, metaHelper aims to support researchers in conducting more reliable and efficient meta-analyses.
To address the uncertainty around effective interventions for improving reproducibility in science, we developed SOLES - a Systematic Online Living Evidence Summary - using machine-assisted screening and AI-driven annotations, aided by community feedback, presented in an interactive web dashboard.
Long abstract
Reproducibility is fundamental to scientific progress. Multiple interventions to improve reproducibility have been proposed and/or tested, yet it remains unclear which strategies are most effective. As part of the iRISE (improving Reproducibility in SciencE) project, we have created a Systematic Online Living Evidence Summary (SOLES) to identify, curate, and visualise the entire literature base, aided by artificial intelligence (AI).
We systematically identified published articles describing interventions to improve reproducibility (n=16832 included). After dual-screening a subset (n=5000) for relevance, we trained a machine learning classifier to identify all relevant articles. We annotated 138 articles for predefined attributes, including scientific discipline, intervention, outcome, participants, and location. Using these annotations, we designed prompts and evaluated the annotation capabilities of different large-language models. The best performing approach was then applied across the included studies.
All outputs are presented on an interactive web dashboard, where users can interrogate the latest evidence (https://camarades.shinyapps.io/irise-soles/). To maintain a “living” evidence base, we automated the process, allowing for weekly updates. To ensure comprehensive coverage, we recently integrated grey literature sources, including preprints and conference abstracts. Our recently implemented feedback loop allows users to suggest corrections to automated screening decisions and annotations. These corrections can help us make continuous, community-driven improvements to the accuracy of our AI-driven screening and annotation process.
IRISE-SOLES provides the scientific community with a comprehensive, multi-disciplinary, up-to-date summary of interventions to improve reproducibility. The dashboard will allow researchers, policymakers and other stakeholders to make informed, evidence-based decisions on activities they undertake to improve reproducibility.
Machine learning use in evidence synthesis is growing, rising from 0.6% (2018) to 12.8% (2024). Of 2,253 reviews, ~5% reported ML, mostly for screening (~95%), with underreporting concerns. Few studies applied ML beyond screening. Standardized reporting is needed for transparency and rigor.
Long abstract
Evidence synthesis (ES) aggregates and evaluates research to enhance applicability, inform evidence-based practices, identify knowledge gaps, and guide policy. It supports decision-making and advances scientific consensus across disciplines but typically requires significant human effort. The growing volume of research has compounded these demands, prompting interest in integrating machine learning (ML) to improve efficiency in ES tasks. This study examines reported use of ML in evidence syntheses published in the Cochrane Database of Systematic Reviews, Campbell Systematic Reviews, and Environmental Evidence from 2017 to 2024. Of 2,253 studies analyzed, ~89% were from Cochrane, ~7% from Campbell, and ~4% from Environmental Evidence. The use of ML was explicitly reported in only ~5% of studies, primarily for screening (~95%). Few studies applied ML to other review stages, with four reporting it for search and one each for data extraction and analysis. Only one study reported ML use across multiple stages (search and screening). The first reported ML usage appeared in 2018 (~0.6% of studies), rising to 12.8% in 2024, representing a 2033% increase over six years. While 642 studies (~28%) reported use of ML-enabled tools for screening, only ~18% of those explicitly reported the use of ML functionalities, raising concerns about underreporting. Additionally, only ~6% of ML-reporting studies noted potential biases or limitations inherent to ML techniques. These findings highlight the need for standardized reporting guidelines to ensure transparency and reproducibility in ML-assisted evidence synthesis. Reducing time and effort while maintaining methodological rigor is essential for integrating ML into ES workflows.
Medical systematic reviews follow a rigorous process and assess the quality of evidence using the GRADE framework. We gathered 60,000 gradings of evidence of 6,300 Cochrane systematic reviews published over 15 years. Overall, the level of evidence was high in only 7% with no improvement over time.
Long abstract
Context: Systematic reviews and meta-analyses are essential to support decision making. In the context of health research, the Cochrane Database of Systematic Reviews (CDSR) is renowned for its rigorous synthesis of evidence and its updating process as new evidence emerges. Cochrane implements the GRADE framework and reports the certainty of evidence—high, moderate, low, very low— for all prespecified primary outcomes in a summary of finding tables. Our study aims to advance metascience by elucidating trends in evidence quality using large-scale systematic reviews data.
Methods: We identified all Cochrane systematic reviews indexed in CDSR from 2010 to 2025 that reported a summary finding table, and extracted certainty ratings using chatgpt-4o-mini. A quality assurance process of data extraction is ongoing. All data and codes will be shared on Github and on an associated Zenodo repository with a DOI.
Results: We identified 6,346 systematic reviews, yielding data on 59,979 graded medical outcomes. Our preliminary analyses reveal the following distribution of evidence certainty: 7% high, 24% moderate, 37% low, and 33% very low. Although preliminary results suggest a trend toward worsening certainty over time, these findings may be subject to residual confounding. To mitigate such effects, we will specifically assess the evolution of certainty in updates of reviews addressing the same research questions.
Conclusion: Our results should question how primary research is planned and conducted and the role of systematic reviews for improving primary research.
Heterogeneity of meta-analytical effects is often overlooked. We review three sets of psychological meta-analyses (total 1,207 meta-analyses). 22%-41% of them do not report heterogeneity at all. When reported, heterogeneity is high but rarely considered in the authors’ conclusions.
Long abstract
In psychology, the perception that meta-analyses represent conclusive evidence is widespread. Yet, recent findings contradicting some of the most prominent meta-analyses of the discipline indicate that meta-analytical evidence may be largely overstated, distorting research results and leading practitioners astray. Part of the reason might be that meta-analysts often base their conclusions on statistical significance (p-values) and the size of the average effect, overlooking other crucial information, such as the heterogeneity of effects. Here, we review three datasets: a pool of 100 most cited meta-analyses across five subfields of psychology (applied, clinical, developmental, educational, social), a pool of 714 meta-analyses published in social-psychology journals, and a pool of 393 meta-analyses from the PSYNDEX/PsychOpen CAMA database. Preliminary results indicate that 21.50% of the most cited meta-analyses and 40.99% of meta-analyses from social psychology journals do not report any information about heterogeneity at all. When heterogeneity is reported, it tends to be high, with average I^2 values of 69.00%, 70.20%, and 80.93% in the three datasets, and the average tau = .10, .20, and .21, respectively, accompanying an average effect of r = .27, .21, and .20, respectively. This elevated heterogeneity, combined with a small average effect, complicates the interpretations of results, which is rarely considered in the conclusions formulated by the authors of the meta-analyses. These omissions might contribute to low reliability of meta-analytical results in psychology.
We argue for the importance of a new metadata standard for scientific publishing that embeds key research elements such as hypotheses, test statistics, and focal results in machine-readable format. Such a standard aims to enhance automated synthesis, meta-research, and transparency.
Long abstract
With the accelerating growth of the scientific literature, endeavours to identify its underlying quality and synthesize knowledge (e.g., systematic literature reviews, meta-analyses, and p-curve analyses) have become ever more relevant and the development of metadata databases (e.g. Scopus, CrossRef, OpenAlex) has hugely facilitated the process of identifying relevant papers for these purposes. However, access restrictions and a lack of interoperability between different databases hinder this progress and especially for non-open-access publications, only titles, abstracts, and keywords are typically available for automated analysis, severely constraining analysis to investigate large text corpora.This forces scientists to resort to laborious and inefficient text mining approaches in which they have to acquire document full-texts, convert them to machine readable formats and extract the relevant information.
We propose developing a new metadata standard for scientific publishing that embeds key research elements—formalized hypothesis, test statistics, and results— into a structured, machine-readable format. By extending metadata beyond abstracts, this standard would enhance automated research synthesis, enable large-scale metascientific investigations, and improve transparency and reproducibility.
The Metascience 2025, with its unique intersection between researchers and other stakeholders of scientific process, seems to us to be an ideal opportunity to advance this topic. We will explore challenges in implementation, potential incentives for adoption, and pathways for integration within existing infrastructures. By discussing with, and fostering collaboration across disciplines and among institutions, publishers, librarians, and research infrastructure providers we seek to establish a roadmap for a metadata standard that can transform metascientific research at scale.
Short Abstract
Evidence synthesis is often placed on the top of the hierarchy of scientific evidence and there is an increasing amount of meta-analyses and systematic reviews produced every year. This session will focus on new tools, methods and the credibility of evidence synthesis.
Long Abstract
This session highlights innovations and challenges in evidence synthesis, meta-analysis, and reproducibility tools. Robert Emprechtinger introduces metaHelper, an R package and web app that streamlines statistical transformations in meta-analysis, featuring effect size conversions and an evaluation via a randomized controlled trial. Sean Smith presents SOLES—a Systematic Online Living Evidence Summary—leveraging machine-assisted screening, AI-driven annotations, and community input to track interventions for improving reproducibility. Kristen Scotti talks about a large-scale review of 2,253 systematic reviews reveals a sharp rise in machine learning use for screening (from 0.6% in 2018 to 12.8% in 2024), though broader ML applications remain rare and underreported. Thomas Starck reports on a meta-research study of 6,294 Cochrane reviews shows that only 7% of evidence gradings were rated high quality, with no improvement over 15 years. Kinga Bierwiaczonek examines heterogeneity reporting in 1,207 psychological meta-analyses, finding that 22–41% omit heterogeneity entirely, and when reported, it is often ignored in conclusions. Finally, Maximilian Frank proposes a new metadata standard in scientific publishing aims to embed key research elements—such as hypotheses and test statistics—in machine-readable formats to support automated synthesis and transparency.
Accepted papers
Session 1 Tuesday 1 July, 2025, -