Fake papers abound in the literature

Fake papers abound in the literature (via The Conversation) / Feb 2025 / DOI

Sadiq Jaffer sent along this piece in The Conversation last week about the remarkable number of academic papers that are now AI generated. The numbers of these papers are probably underestimated:

These papers are absorbed into the worldwide library of research faster than they can be weeded out. About 119,000 scholarly journal articles and conference papers are published globally every week, or more than 6 million a year. Publishers estimate that, at most journals, about 2% of the papers submitted – but not necessarily published – are likely fake, although this number can be much higher at some publications. -- Frederik Joelving et al, The Conversation

What caught my eye in this article is their development of the Problematic Paper Screener, which the good folks at Retraction Watch developed. It works with high precision to detect papers issued by grammar-based generators. They noted in another article that over 764,000 articles cited papers that could be unreliable, further illustrating the creeping unreliability. Sadiq Jaffer and I are planning to run this over our growing paper corpus, but I can't find the source code to their system, just the hosted version.

Meanwhile, datasets are also under similar threat of causing recursive model collapse. The Wordfreq team announced in September 2024 that they would discontinue updating their corpus because generative AI has polluted the data and information that used to be free has became expensive. Patrick Ferris also noted the related problem of dataset versioning becoming unreliable across science in "Uncertainty at scale: how CS hinders climate research", but for different reasons -- large datasets are inherently difficult to version and reproduce (it's quite hard to share a terabyte of data over the Internet easily, even in this day and age).

Wayne State scientists Frank Cackowski and Steven Zielske carried out experiments based on a paper they later found to contain false data. Credit: Amy Sacka

Another big development this week was the release of OpenAI's Deep Research feature, which goes off and really mines a literature corpus for information. I've grudgingly updated to their expensive Pro to try this out and will report my findings in a future post. The ability to generate papers has moved well beyond just the grammar generators that the Problem Paper Screener can filter out, so this arms race is unlikely to end well if we're pinning our hopes on detecting AI-generated papers. The current publish-or-perish model has already died; at least our Cambridge promotion process is more enlightened than "just" looking at paper counts!

# 4th Feb 2025•DOI: 10.59350/qmsqz-ark89•backlinks•evidence, llms, science

References

Shumailov et al (2024). AI models collapse when trained on recursively generated data. Nature. 10.1038/s41586-024-07566-y