The AIETF arrives, and not a moment too soon

The AIETF arrives, and not a moment too soon / Feb 2025

The IETF announced their new AI Preferences Working Group (AIPREF), which will "work on standardizing building blocks that allow for the expression of preferences about how content is collected and processed for Artificial Intelligence models". This is quite well timed; the IETF tries not to standardise too early before there is running code but also needs to move before it's too late and a bad defacto standard is chosen. The AI world seems to be at that nexus point right about now, with GPT 4.5 seemingly hitting a scaling wall and possibly triggering the start of a renewed data scraping frenzy.

How do websites interact with AI crawlers right now?

I've found when developing my own website there are a number of approaches to interacting with automated data crawlers. For the record, over 90% of the traffic to this site is from automated sources, so it's a material concern for selfhosting infrastructure.

Ban all bots; humans only plz: I don't want to do this, as I'd like to opt into my writing training next generation foundation models, but would like some agency over how much I need to pay for them to get their data (I am covering the bandwidth costs here, after all), so I just need them to cooperate more to avoid flooding my site. If I do want to ban them, the excellent ai-robots crew maintain a useful list of bad bots.
Ban some bots with a robots.txt: RFC9309 allows for the discrimination of web-crawlers via a robots.txt. We nowadays have not just a few big crawlers mirroring the Internet (like Googlebot and Bingbot), but seemingly thousands of variants competing for the data gold rush (or in my case, for conservation research!) The robots.txt doesn't give us enough control to usefully rate-limit across all of these, unfortunately. You need to regenerate the file every time there are new URLs on the site that don't fit a longest-prefix match. This, combined with having a mega sitemaps file, is a lot of non-cacheable metadata that's just adding to my serving load.
Add server-side throttling for specific bots: On the assumption that there are a bunch of bad bots that mimic good bots, what I really need is to start rate-throttling them all! This is where I am today, and ended up hacking together a bunch of OCaml code for this website to track all the robots request rates and slow down over-eager ones. The rest of the Internet are mostly just asking Cloudflare to take care of this for them, which results in a world of pain for anyone outside of their world view.
Just give the bots what they want, which is Markdown: Since I can't really win the throttling wars in the long term, can I just give the bots what they want, which is the core text without all the HTML around it? The first thing these crawlers do is to tokenize the HTML anyway! There is llms.txt emerging for this. I author my website in Markdown in the first place, and then transform it into the HTML you see here. But it looks like the llms.txt guidelines insist on just one page at the root of the site, and not one Markdown per page. This is probably better for reducing crawling traffic, but it would be a large page even for my humble homepage.
Can I just give you a tarball with my stuff so you leave me alone?: I rebuild my site regularly, so I could just provide the AI bots with a convenient tar/zip of my entire website content, but put it in a common place so I don't have to pay for the download bandwidth. This could include my images, videos, and source markdown which could be used not only for training, but for archival as well. We don't seem to have a common protocol to map URLs to static archives right now, although there are a few web archive formats flying around.

The role of the IETF is to create protocols, not mandate implementations

The IETF has a valuable role to play here to establish a consensus around what a sensible, usable protocol for exchanging data on our websites might look like, rather than mandating any specific backend technology or storage format. There is a lot of nuance around sharing content over HTTP: it supports authentication, caching, access control, rate limiting, and many other features hidden behind a seemingly simple request-response specification.

I'm hoping that the AIPREF process will end up with something that gives me something closer to 5) above than 1). I need an HTTP-based mechanism by which I can express my preferences for AI crawling, and cooperate with the crawlers so that I can ensure maximum collective benefit to both people and bots visiting my site, rather than withdrawing behind a gated community of humans only. However, I think that this requires the establishment of a protocol to help sequence the HTTP requests together and not just a single static file like llms.txt or sitemap.xml.

Back in the 90s, I worked on NetApp/NetCache with John Martin. Bandwidth used to be expensive and so we deployed edge caches that could modify website content with local modifications to common global content. Consider, for example, a local news website that might want to show mostly cached global news, but also modify the HTML to include local news content. You can do that today via JavaScript, but back then the only way to have a protocol to modify the static HTML. The Internet Content Adaptation Protocol was the IETF's answer to creating a structured HTTP-like protocol to allow edge modifications from proxy servers:

ICAP is, in essence, a lightweight protocol for executing a "remote procedure call" on HTTP messages. It allows ICAP clients to pass HTTP messages to ICAP servers for some sort of transformation or other processing ("adaptation"). The server executes its transformation service on messages and sends back responses to the client, usually with modified messages. Typically, the adapted messages are either HTTP requests or HTTP responses. -- RFC3507, IETF

One of the coolest features of ICAP is that is didn't mandate the transformation mechanism, just the protocol. The proxies deployed at the edge networks would get a vector into transforming the data stream. NetCache implemented an implementation of ICAP, and Squid still supports it. What would a similar approach look like for allowing crawlers into your site's content, but leaving lots of freedom for the details of this to be delegated to the crawlers and servers?

Challenges in an open data hoovering protocol

Antoine Fressancourt identifies the main problem facing AIPref:

Given the reports that some current LLM models have been trained on data corpus obtained illegally, I have some doubts that AIPref will be respected.

This is true for the current generation of data crawlers, but also where the opportunity lies for AIPref. Without a systematic way to support replication of non-public data, the situation will get even worse as custom apertures are created into data silos without any integrity underlying them.

The main reason for having a protocol-based solution is that we could support the strong authentication and identification of bots. If (for example) the GoogleBot supplied a token with every HTTP request to fetch my content, I could track its use and perhaps even get compensation for the bandwidth costs. The current methods of bot verification all seem quite weak; they are just IP based checks for example.

This would in turn open a path to the disciplined negotiation for access controlled data bilaterally between crawlers and hosters. More and more content publishers are signing various exclusive deals with AI training companies. Irrespective of your opinion on such deals, a protocol to make it easier to authenticate bots strongly would make the establishment (and ongoing negotiation) of those mechanisms far easier to handle.

We are also seeing rapid adoption of the the Model Context Protocol released a few months ago. This establishes a JSON-RPC specification for LLM clients and data providers to talk to each other locally. It seems odd to me that we'd have a rich "local" specification for data exchange like this for RAG-like systems, but not have one in the wide area across the Internet. As the chair of the AIPREFS group Mark Nottingham notes, platform advantages are not just network effects, so there may be deep repurcussions into the economics of AI here:

In short: there are less-recognised structural forces that push key Internet services into centralized, real-time advertising-supported platforms. Along with factors like network effects and access to data, they explain some of why the Internet landscape looks like it does. -- Mark Nottingham

Just substitute "advertising-supported" with "AI" above and the trend becomes clear. The protocol designs we chose today will form structural forces that decide the future of what the post-advertising driven Internet culture and content architecture looks like. It would be a nice outcome to establish open protocols that are somewhere in between the MCP clients and HTTP servers to facilitate a more equitable outcome rather than pooling all the data to a few big players.

The other consideration is here is that such an open protocol could have utility far beyond "just" managing AI training bots and address the general problem we have that replicating datasets with access control is difficult. This would help the good folk at Archive.org to manage restricted access data sets that might want to become eventually open. There are also geospatial datasets such as biodiversity data that need help managing how they are mirrored, but with access restrictions for geopolitical reasons.

Luckily, the IETF do a lot of things over email, so I've signed up to the AIPREF mailing list to learn more as it develops and hopefully participate!

Changelog. Mar 1st 2024: Thanks to Michael Dales for spotting typos, and Antoine Fressancourt for helpful clarifying questions on Bluesky.

# 28th Feb 2025

notes ai ietf llms protocols

Anil Madhavapeddy, Professor of Planetary Computing