Out-of-the-box LLMs are not ready for conservation decision making / May 2025
Our paper on 
Our findings suggest that, with careful domain-specific design, LLMs could potentially be powerful tools for enabling expert-level use of evidence syntheses and databases. However, general LLMs used "out-of-the-box" are likely to perform poorly and misinform decision-makers. By establishing that LLMs exhibit comparable performance with human synthesis experts on providing restricted responses to queries of evidence syntheses and databases, future work can build on our approach to quantify LLM performance in providing open-ended responses.
In a nutshell, we tested 10 LLMs with six different retrieval strategies on their ability to answer questions related to conservation, benchmarked against the 
We found that language models had very varying results when relying only on their pretrained data, and were particularly bad at answering questions about reptile conservation. However, given some extra training with the CE database, their performance improved dramatically. When we put these models head to head with human experts (from the conservation evidence team), with a set of questions and with RAG access to the database, we found that the models were just as good as our experts, but answered the questions much much much faster (near instant).
Essentially, LLMs without extra training are likely to perform poorly and misinform decision-makers. This is crucial when considering how to build AI infrastructure for 
Particular props to 
This summary is adapted from a social media summary by