Fake papers abound in the literature (via The Conversation) / Feb 2025
These papers are absorbed into the worldwide library of research faster than they can be weeded out. About 119,000 scholarly journal articles and conference papers are published globally every week, or more than 6 million a year. Publishers estimate that, at most journals, about 2% of the papers submitted – but not necessarily published – are likely fake, although this number can be much higher at some publications. -- Frederik Joelving et al, The Conversation
What caught my eye in this article is their development of the Problematic Paper Screener, which the good folks at Retraction Watch developed. It works with high precision to detect papers issued by grammar-based generators. They noted in another article that over 764,000 articles cited papers that could be unreliable, further illustrating the creeping unreliability.
Meanwhile, datasets are also under similar threat of causing recursive model collapse. The Wordfreq team announced in September 2024 that they would discontinue updating their corpus because generative AI has polluted the data and information that used to be free has became expensive.

Another big development this week was the release of OpenAI's Deep Research feature, which goes off and really mines a literature corpus for information. I've grudgingly updated to their expensive Pro to try this out and will report my findings in a future post. The ability to generate papers has moved well beyond just the grammar generators that the Problem Paper Screener can filter out, so this arms race is unlikely to end well if we're pinning our hopes on detecting AI-generated papers. The current publish-or-perish model has already died; at least our Cambridge promotion process is more enlightened than "just" looking at paper counts!