Out-of-the-box LLMs are not ready for conservation decision making

Out-of-the-box LLMs are not ready for conservation decision making / May 2025

Our paper on how the careful design of LLMs is crucial for expert-level evidence retrieval has been published today in PLOS One and is available fully open access!

Our findings suggest that, with careful domain-specific design, LLMs could potentially be powerful tools for enabling expert-level use of evidence syntheses and databases. However, general LLMs used "out-of-the-box" are likely to perform poorly and misinform decision-makers. By establishing that LLMs exhibit comparable performance with human synthesis experts on providing restricted responses to queries of evidence syntheses and databases, future work can build on our approach to quantify LLM performance in providing open-ended responses.

In a nutshell, we tested 10 LLMs with six different retrieval strategies on their ability to answer questions related to conservation, benchmarked against the Conservation Evidence database that has been hand-assembled by experts over the last two decades. In some of the retrieval scenarios, models were only allowed to use their pretrained knowledge, whereas in others they had access to the relevant parts of the hand-curated database.

We found that language models had very varying results when relying only on their pretrained data, and were particularly bad at answering questions about reptile conservation. However, given some extra training with the CE database, their performance improved dramatically. When we put these models head to head with human experts (from the conservation evidence team), with a set of questions and with RAG access to the database, we found that the models were just as good as our experts, but answered the questions much much much faster (near instant).

Essentially, LLMs without extra training are likely to perform poorly and misinform decision-makers. This is crucial when considering how to build AI infrastructure for public policymaking.

Particular props to Radhika Iyer who did much this work in her summer break last year as part of the UROP program. See also the fantastic EEG seminar video below of the talk that the student group who worked on this over the summer gave towards the end of 2024! And as an opportunistic advert, we are recruiting (with Alec Christie, Sam Reynolds, Sadiq Jaffer, Bill Sutherland and the rest of the CE team) for more undergrads for the coming summer.

This summary is adapted from a social media summary by Sam Reynolds. Join the conversation there!

Read more about Careful design of Large Language Model pipelines enables expert-level retrieval of evidence-based information from syntheses and databases.

# 16th May 2025•ce, ai, conservation, evidence, llms