A fully AI-generated paper just passed peer review; notes from our evidence synthesis workshop / Mar 2025
Access to reliable and timely scientific evidence is utterly vital for the practise of responsible policymaking, especially with all the turmoil in the world these days. At the same time, the evidence base on which use to make these decisions is rapidly morphing under our feet; the first entirely AI-generated paper passed peer review at an ICLR workshop today. We held a workshop on this topic of AI and evidence synthesis at Pembroke College last week, to understand both the opportunities for the use of AI here, the strengths and limitations of current tools, areas of progress and also just to chat with policymakers from DSIT and thinktanks about how to approach this rapidly moving area.
(The following notes are adapted from jottings from Jessica Montgomery, Sam Reynolds, Annabelle Scott and myself. They are not at all complete, but hopefully useful!)
We invited a range of participants to the workshop and held it at Pembroke College (the choice of the centuries-old location felt appropriate). Jessica Montgomery and Neil Lawrence expertly emceed the day, with Bill Sutherland, Sadiq Jaffer and Sam Reynolds also presenting provocations to get the conversation going.
Evidence synthesis at scale
Jessica Montgomery described the purpose of the workshop as follows:
Evidence synthesis is a vital tool to connect scientific knowledge to areas of demand for actionable insights. It helps build supply chains of ideas, that connect research to practice in ways that can deliver meaningful improvements in policy development and implementation. Its value can be seen across sectors: aviation safety benefitted from systematic incident analysis; medical care has advanced through clinical trials and systematic reviews; engineering is enhanced through evidence-based design standards. When done well, evidence synthesis can transform how fields operate. However, for every field where evidence synthesis is embedded in standard operating practices, there are others relying on untested assumptions or outdated guidance. -- Jessica Montgomery, AI@Cam
One such field that benefits from evidence is conservation, which is what Bill Sutherland and his team have been working away on for years. Bill went on to discuss the fresh challenges that AI brings to this field, because it introduces a new element of scale which could augment relatively slow human efforts.
Scale poses a fundamental challenge to traditional approaches to evidence synthesis. Comprehensive reviews take substantial resources and time. By the time they are complete – or reach a policy audience – the window for action may have closed. The Conservation Evidence project at the University of Cambridge offers an example of how researchers can tackle this challenge. The Conservation Evidence team has analysed over 1.3M journals from 17 languages and built a website enabling access to this evidence base. To support users to interrogate this evidence base, the team has compiled a metadataset that allows users to explore this literature based on a question of interest, for example looking at what conservation actions have been effective in managing a particular invasive species in a specified geographic area. -- Jessica Montgomery, AI@Cam
The AI for evidence synthesis landscape is changing very rapidly, with a variety of specialised tools now being promoted in this space. This ranges from commercial tools such as Gemini Deep Research and OpenAI's deep searcher, to research-focused systems such as Elicit, DistillerSR, and RobotReviewer. These tools vary in their approach, capabilities, and target users, raising questions about which will best serve different user needs. RobotReviewer, for example, notes that:
[...] the machine learning works well, but is not a substitute for human systematic reviewers. We recommend the use of our demo as an assistant to human reviewers, who can validate the machine learning suggestions, and correct them as needed. Machine learning used this way is often described as semi-automation. -- About RobotReviewer
The problem, of course, is that these guidelines will often be ignored by reviewers who are under time pressure, and so the well established protocols for systematic reviewers are under some threat.
How do we get more systematic AI-driven systematic reviews?
Sadiq Jaffer and Sam Reynolds then talked about some of the computing approaches required to achieve a more reliable evidence review base. They identified three key principles for responsible AI integration into evidence synthesis:
- Traceability: Users should see which information sources informed the evidence review system and why any specific evidence was included or excluded.
- Transparency: Open-source computation code, the use of open-weights models, ethically sourced training data, and clear documentation of methods mean users can scrutinise how the system is working.
- Dynamism: The evidence outputs should be continuous updated to refines the evidence base, via adding new evidence and flagging retracted papers.
Alex Marcoci pointed out his recent work on AI replication games which I found fascinating. The idea here is that:
Researchers will be randomly assigned to one of three teams: Machine, Cyborg or Human. Machine and Cyborg teams will have access to (commercially available) LLM models to conduct their work; Human teams of course rely only on unaugmented human skills. Each team consists of 3 members with similar research interests and varying skill levels. Teams will be asked to check for coding errors and conduct a robustness reproduction, which is the ability to duplicate the results of a prior study using the same data but different procedures as were used by the original investigator. -- Institute for Replication
These replication games are happening on the outputs of evidence, but the inputs are also rapidly changing with today's announcement of a fully generated AI papers passing peer review. It's hopefully now clear that AI is a huge disruptive factor in evidence synthesis.
The opportunity ahead of us for public policy
We first discussed how AI could help in enhancing systematic reviews. AI-enabled analysis can accelerate literature screening and data extraction, therefore helping make the reviews more timely and comprehensive. The opportunity ahead of us is to democratise access to knowledge synthesis by making it available to those without specialised training or institutional resources, and therefore getting wider deployment in countries and organisations without the resources to commission traditional reviews.
However, there are big challenges remaining in gaining access to published research papers and datasets. The publishers have deep concerns over AI-generated evidence synthesis, and more generally about the use of generative AI involving their source material. But individual publishers are already selling their content to the highest bidder as part of the data hoarding wars and so the spread of the work into pretrained models is not currently happening equitably or predictably. Neil Lawrence called this "competitive exclusion", and it is limiting communication and knowledge diversity.
The brilliant Jennifer Schooling then led a panel discussion about the responsible use of AI in the public sector. The panel observed that different countries are taking different approaches to the applications of AI in policy research. However, every country has deep regional variances in the application of policy and priorities, which means that global pretrained AI models always need some localized retuning. The "one-size-fits-all" approach works particularly badly for policy, where local context is crucial to a good community outcome that minimises harm.
Policymakers therefore need realistic expectations about what AI can and cannot do in evidence synthesis. Neil Lawrence and Jennifer Schooling came up with the notion that "anticipate, test, and learn" methods must guide AI deployment in policy research; this is an extension of the "test and learn" culture being pushed by Pat McFadden as part of the Labour plan to reform the public sector this year. With AI systems, Alex Marcoci noted that we need to be working with the end users of the tools to scope what government departments need and want. These conversations needs to happen before we build the tools, letting us anticipate problems before we deploy and test them in a real policy environment. Neil Lawrence noted that policy doesn't have a simple "sandbox" environment to test AI outcomes in, unlike many other fields where simulation is practical ahead of deployment.
Lucia Reisch noted that users must maintain critical judgement when using these new AI tools; the machine interfaces must empower users towrads enhancing their critical thinking and encouraging reflection on what outputs are being created (and what is being left out!). Lucia also mentioned that her group helps run the "What Works" summit, which I've never been to but plan on attending next it rolls around.
The energy requirements for training and running these large scale AI models are significant as well, of course, raising questions about the long-term maintenance costs of these tools and their environmental footprint. There was wide consensus that the UK should develop its own AI models to ensure resilience and sovereignty, but also to make sure that the regional finetuning to maximise positive outcomes is under clear local control and not outsourced geopolitically. By providing a single model that combines UK national data, we would also not waste energy with lots of smaller training efforts around the four nations.
Thanks Annabelle Scott for such a stellar organisation job and to Pembroke for hosting and all for attending, and please do continue the discussion about this on LinkedIn if you are so inclined.
Related News
- The AIETF arrives, and not a moment too soon / Feb 2025
- Thoughts on the National Data Library and private research data / Feb 2025
- Fake papers abound in the literature (via The Conversation) / Feb 2025
- Careful design of Large Language Model pipelines enables expert-level retrieval of evidence-based information from conservation syntheses / Jan 2025
- Conservation Evidence Copilots / Jan 2024