A fully AI-generated paper just passed peer review; notes from our evidence synthesis workshop / Mar 2025 / DOI
Access to reliable and timely scientific evidence is utterly vital for the practise of responsible policymaking, especially with all the turmoil in the world these days. At the same time, the evidence base on which use to make these decisions is rapidly morphing under our feet; the first entirely AI-generated paper passed peer review at an ICLR workshop today. We held a workshop on this topic of AI and evidence synthesis at Pembroke College last week, to understand both the opportunities for the use of AI here, the
(The following notes are adapted from jottings from
We invited a range of participants to the workshop and held it at Pembroke College (the choice of the centuries-old location felt appropriate).

Evidence synthesis at scale
Evidence synthesis is a vital tool to connect scientific knowledge to areas of demand for actionable insights. It helps build supply chains of ideas, that connect research to practice in ways that can deliver meaningful improvements in policy development and implementation. Its value can be seen across sectors: aviation safety benefitted from systematic incident analysis; medical care has advanced through clinical trials and systematic reviews; engineering is enhanced through evidence-based design standards. When done well, evidence synthesis can transform how fields operate. However, for every field where evidence synthesis is embedded in standard operating practices, there are others relying on untested assumptions or outdated guidance. --
Jessica Montgomery , AI@Cam
One such field that benefits from evidence is
Scale poses a fundamental challenge to traditional approaches to evidence synthesis. Comprehensive reviews take substantial resources and time. By the time they are complete – or reach a policy audience – the window for action may have closed. The Conservation Evidence project at the University of Cambridge offers an example of how researchers can tackle this challenge. The Conservation Evidence team has analysed over 1.3M journals from 17 languages and built a website enabling access to this evidence base. To support users to interrogate this evidence base, the team has compiled a metadataset that allows users to explore this literature based on a question of interest, for example looking at what conservation actions have been effective in managing a particular invasive species in a specified geographic area. --
Jessica Montgomery , AI@Cam
The AI for evidence synthesis landscape is changing very rapidly, with a variety of specialised tools now being promoted in this space. This ranges from commercial tools such as Gemini Deep Research and OpenAI's deep searcher, to research-focused systems such as Elicit, DistillerSR, and RobotReviewer. These tools vary in their approach, capabilities, and target users, raising questions about which will best serve different user needs. RobotReviewer, for example, notes that:
[...] the machine learning works well, but is not a substitute for human systematic reviewers. We recommend the use of our demo as an assistant to human reviewers, who can validate the machine learning suggestions, and correct them as needed. Machine learning used this way is often described as semi-automation. -- About RobotReviewer
The problem, of course, is that these guidelines will often be ignored by reviewers who are under time pressure, and so the well established protocols for systematic reviewers are under some threat.

How do we get more systematic AI-driven systematic reviews?
- Traceability: Users should see which information sources informed the evidence review system and why any specific evidence was included or excluded.
- Transparency: Open-source computation code, the use of open-weights models, ethically sourced training data, and clear documentation of methods mean users can scrutinise how the system is working.
- Dynamism: The evidence outputs should be continuous updated to refines the evidence base, via adding new evidence and flagging
retracted papers .
Researchers will be randomly assigned to one of three teams: Machine, Cyborg or Human. Machine and Cyborg teams will have access to (commercially available) LLM models to conduct their work; Human teams of course rely only on unaugmented human skills. Each team consists of 3 members with similar research interests and varying skill levels. Teams will be asked to check for coding errors and conduct a robustness reproduction, which is the ability to duplicate the results of a prior study using the same data but different procedures as were used by the original investigator. -- Institute for Replication
These replication games are happening on the outputs of evidence, but the inputs are also rapidly changing with today's announcement of a fully generated AI papers passing peer review. It's hopefully now clear that AI is a huge disruptive factor in evidence synthesis.

The opportunity ahead of us for public policy
We first discussed how AI could help in enhancing systematic reviews. AI-enabled analysis can accelerate literature screening and data extraction, therefore helping make the reviews more timely and comprehensive. The opportunity ahead of us is to democratise access to knowledge synthesis by making it available to those without specialised training or institutional resources, and therefore getting wider deployment in countries and organisations without the resources to commission traditional reviews.
However, there are big challenges remaining in
The brilliant
Policymakers therefore need realistic expectations about what AI can and cannot do in evidence synthesis.
The energy requirements for training and running these large scale AI models
are significant as well, of course, raising questions about the long-term
maintenance costs of these tools and their environmental footprint. There was
wide consensus that the UK should develop its own AI models to ensure
resilience and sovereignty, but also to make sure that the regional finetuning
to maximise positive outcomes is under clear local control and not outsourced
geopolitically. By providing a single model that combines

Thanks
References
- Madhavapeddy (2025). Fake papers abound in the literature. 10.59350/qmsqz-ark89
- Madhavapeddy (2025). Thoughts on the National Data Library and private research data. 10.59350/fk6vy-5q841
- Madhavapeddy (2025). The AIETF arrives, and not a moment too soon. 10.59350/agfta-8wk09
- Iyer et al (2025). Careful design of Large Language Model pipelines enables expert-level retrieval of evidence-based information from syntheses and databases. 10.1371/journal.pone.0323563