Four Ps for Building Massive Collective Knowledge Systems / Nov 2025 / DOI
I've been building some big
I found the perfect place to codify this at the ARIA Workshop on Collective Flourishing that
Will building these collective knowledge systems be a transformative capability for human society? Hot on the heels of COP30 concluding indecisively, I've been getting excited by decision making towards biodiversity going down a more positive path in IPBES. We could empower decisionmakers at all scales (local, country, international) to be able to move five times faster on actions about global species extinctions, unsustainable wildlife trade and food security, while rapidly assimilating extraordinarily complex evidence chains. I'll talk about this more while explaining the principles...
This post is split up into a few parts. First, let's introduce ARIA's convening role in this. Then I'll introduce the principles of permanence, provenance, permission, and placement. The post does assume some knowledge of Internet protocols; I'll write a more accessible one for a general audience at a future date when the ideas are more baked!
Collective Flourishing at ARIA
The ARIA workshop was held in lovely Birmingham, hosted by programme manager Nicole Wheeler. It explored four core beliefs that ARIA had published about this opportunity space in collective flourishing:
- Navigating towards a better future requires clarity on direction and path > we need the capability to make systemic complexity legible so we can envision and deliberate over radically different futures.
- Simply defining our intent for the future is not enough → we need a means of negotiating our fragmented values into shared, actionable plans for collective progress.
- Our current cognitive, emotional, and social characteristics are not immutable constants → human capacity can and will change over time, and we need tools to figure out together how we navigate this change.
- Capabilities that augment our vision, action, and capacity are powerful and can have unintended consequences → we must balance the pressing need for these tools with the immense responsibility they entail. -- ARIA Collective Flourishing Opportunity Space, 2025
I agree with these values, and translating these into concrete systems concepts seems a useful exercise. The workshop was under Chatham House rules, which hamstrings my ability to credit individuals, but the gathering was a useful and eclectic mix of social scientists and technologists. There was also a real sense of collective purpose: a desire to reignite UK growth and decrease inequality.

The 4P's for Collective Knowledge Systems
First, why come up with these system design principles at all? I believe strongly both in building systems from the groundup, and also in
eating my own dogfood
and using whatever I build. I also define knowledge broadly: not just academic
papers, but also geospatial datasets, blogs and other more conventionally
"informal" knowledge sources that are increasingly complementing scholarly
publishing as a source of timely knowledge. As I write this, the UK's budget announcement was released an hour early by the OBR, throwing markets into real-time uncertainty.
Towards this, several of my colleagues such as
To get to the next level of collective meshing (not just for these personal sites but for
I'll now dive into the four principles: (i) permanence; (ii) provenance; (iii) permission; and (iv) placement.
P1: Permanence (aka DOIs for all with the Rogue Scholar!)
Firstly, knowledge that is spread around the world needs a way to be retrieved reliably. Scholarly publications, especially open-access ones, are distributed both digitally and physically and often replicated. While papers are "big" enough pieces of work to warrant this effort, what about all the other outputs we have such as (micro)blogs, social media posts, and datasets? A reliable addressing system is essential to be retrieve these too, and we can do this via standard Internet protocols such as HTTP and DNS.
The eagle-eyed among you might notice that my site now has a unique DOI for most post. A Digital Object Identifier is something you more conventionally see associated with academic papers, but thanks to the hard work of Martin Fenner we can now have them for other forms of content! Martin set up the Rogue Scholar which permits any standards-compliant site with an Atom or JSONFeed to be assigned a DOI automatically.
This post, for example, has the 10.59350/418q4-gng78 DOI assigned to it, which forms its unique DOI identifier. This can be resolved into the real location by retrieving the DOI URL, which issues a HTTP redirect to this post:
$ curl -I https://doi.org/10.59350/418q4-gng78
HTTP/2 302
date: Wed, 26 Nov 2025 12:48:43 GMT
location: https://anil.recoil.org/notes/fourps-for-collective-knowledge
Crucially, this DOI URL is not the only identifier for this post, as you can still also use my original homepage URL. However, it's an identifier that can be redirected to a new location if the content moves, and also has extra metadata associated with it that help with keeping track of networks of knowledge.

Let's look at some more details of the extra useful metadata, by peering at one of my
Tracking authorship metadata
Firstly, the authorship information helps to identify me concretely across name variations. My own ORCID forms a unique identifier for my own scholarly publishing, and this is now tied to my blog post. You can then search for my ORCID and find my posts, but also find it in other indexing systems such as CrossRef which index scholarly metadata. OpenAlex has just rewritten their codebase and released it a few weeks ago, with 10s of millions of new types of works indexed.
Curating databases that are this large across decades clearly leads to some inconsistencies as people move around jobs and change their circumstances. Identifying "who" has done something is therefore a surprisingly tricky metadata problem. This is one of the areas where

Forming a reference mesh
Secondly, references from links within this post are extracted out and linked
to other DOIs. I do this by generating a structured JSONFeed
which breaks out metadata for each post by scanning the links within my source
Markdown. For example, here is an excerpt for one of the "references" fields
in my post:
{ "url": "https://doi.org/10.33774/coe-2025-rmsqf",
"doi": "10.33774/coe-2025-rmsqf",
"cito": [ "citesAsSourceDocument" ] }, {
"url": "https://doi.org/10.59350/hasmq-vj807",
"doi": "10.59350/hasmq-vj807",
"cito": [ "citesAsRelated" ] },
This structured list of references also includes CITO conventions to also list how the citation should be interpreted, which may be useful input to LLMs that are interpreting a document. I've published an OCaml-JSONFeed library that conveniently lists all the citation structures possible. This reference metadata is hoovered up by databases such as CrossRef which use them to maintain their giant graph databases that associate posts, papers and anything else with a DOI with each other.
To make this as easy as possible to do with any blog content online, Rogue Scholar has augmented how it scans posts so that just adding a "References" header to your content is enough to make this just work. We now have an interconnected mesh of links between diverse blogs and papers and datasets, all using simple URLs!
Archiving and versioning posts
Thirdly, the metadata and Atom feeds are used to archive the contents of the post via the Internet Archive Archive-It service. This is also not as straightforward as you might expect; the problem with archiving HTML straight from the source is that the web pages you read are usually quite a mess of JavaScript and display logic, whereas the essence of the page is hidden.
For example, look at the archive.org version of one of my posts vs the Rogue Scholar version of the same post. The latter is significantly cleaner, since the "archival" version actually uses my blog feed instead of the original HTML. The feed reader version strips out all the unnecessary display gunk so that it can be read by clients like NetNewsWire or Thunderbird. There is some work that needs to happen on the Atom feed generation side to really make this clean; for example, I learnt about how to lay out footnotes to be feed-reader friendly.

To wrap up the first P of Permanence, we've seen that it's a bit more involved than "simply archiving it". Some metadata curation and formatting flexibility really helps to clean up the connections. If you have your own blog, you should sign up to Rogue Scholar. Martin has just incorporated it as a German non-profit organisation, showing he's thinking about the long-term sustainability of such ventures as well.
P2: Provenance (is it AI poison or rare literature?)
The enormous problem we're facing with collective intelligence right now is that the Internet is getting flooded by AI generated slop. While there are obvious dangers to our collective sanity and attention spans, there's also the pragmatic problem that recursive training causes model collapse. If we just feed our models the output of other language models, we greatly dilute the quality of the resulting LLMs and the overall quality of collective knowledge.
We observed the societal implications in our
The publication of ever-larger numbers of problematic papers, including fake ones generated by artificial intelligence, represents an existential crisis for the established way of doing evidence synthesis. But with a new approach, AI might also save the day. -- Will AI speed up literature reviews or derail them entirely?, 2025
We urgently need to build accurate provenance information into our collective knowledge networks to distinguish where some piece of knowledge came from. Efforts like Rogue Scholar and Kagi Small Web do this by human judgement: a community keeps an eye on the feeds and filters out the obviously bad actors.
Luckily though, we do have some partial solutions already for keeping track of provenance:
- Code can be versioned through Git, now widely adopted, but also
federated via Tangled . - Data can be traced through services like Zenodo and even given DOIs just like Rogue Scholar has been doing. This is not perfect yet since it's difficult to
continuously update large datasets , but technology is steadily advancing here. - Code and data can be versioned through dataflow systems, of which there are many out there include several we discussed at
PROPL 2025 , such asAadi Seth 's dynamic STACs or our own OCurrent, or Nature+CodeOcean for scientific computation. - Rogue Scholar supports DOI versioning of posts to allow intentional edits of the same content.
What's missing is a provenance protocol by which each of these "islands of provenance" can interoperate across each other's boundaries. Almost every project runs its own CI systems that never share the details of how they got their data and code. Security organisations are now recommending Software Bill of Materials be generated for all software, and Docker Hardened Images are acting as an anchor for wider efforts in this space. The IETF is moving to advance standards of provenance but perhaps too slowly and conservatively given the
An area I'm going to investigate in the future is how HTTP-based provenance headers might help glue these together, so that a collective knowledge crawler doesn't need to build a global provenance graph (which would be overwhelmingly massive) to filter out non-trusted primary content.

P3: Permission (not everything needs to go into the Borg)
The Internet is pretty good about building giant public databases, and it's also pretty good at supporting storing secret data. However, it's terrible at supporting semi-private access to remote sites.
Consider a really obvious collective knowledge case: I want to expose my draft papers that I'm working on with a diverse group of people. I collaborate with dozens of people all over the world, and so want to selectively grant them access to my works-in-progress. Why is this so difficult to do?
It's currently easy to use individual services to grant access; for example, I might share my Overleaf or my Google Drive for a project, but propagating those access rights across services is near impossible as soon as you cross a project or API boundary. There are a few directions we could go to break this problem down into easier to solve chunks:
- If we make it easier to self-host services, for example via initiatives like Eilean, then having access to the databases directly makes it much easier to take nuanced decisions about which bits of the data to grant access to. I run, for example, three separate video hosting sites: one for OCaml, for the EEG and another personally. Each of these federates across each other via ActivityPub, but still supports private videos.
- There was research into distributed permission protocols like Macaroons at the height of the cloud boom a decade ago, but they've all been swallowed up into the bottomless pit of pain that is oAuth. It's high time we resurrected some of the more nuanced work on fine-grained authentication that doesn't give access to absolutely everything and/or SMS you at 2am requesting a verification code.
- Rather than 'yes/no' decisions, we could also share different views of the data depending on who's asking. This used to be difficult due to the combinatorics involved, but you could imagine nowadays applying a local LLM to figure out the rich context. The DeepMind Concordia project takes this idea even further with social simulations based on the same principles.
When we look at the
Biodiversity needs spatial permissioning
Zooming out to a global usecase, biodiversity data is a prime example of where
everything can't be open. Economically motivated rational actors (i.e.
poachers) are highly incentivised to use all available data to figure out where
to snarf a rare species, and so some of this presence data is vital to keep
tight control over. But the pendulum swings both ways, and without robust
permissions mechanisms to share the data with well intentioned actors, we
cannot make evidence-driven planning decisions for global topics such as
I asked
- Protected Planet, 10s of thousands of users. The world's most trusted, up-to-date, and complete source of information on protected areas and other effective area-based conservation measures (OECMs). Includes effectiveness of protected and conserved area management; updated monthly with submissions from governments, NGOs, landowners, and communities.
- UN Biodiversity Lab, 1000s of governments, NGO users. A geospatial platform with 400+ of the world’s best data layers on nature, climate change, and sustainable development. Supports country-led efforts for planning, monitoring, and reporting; linked to the Convention on Biological Diversity's global nature agreements.
- CITES Wildlife Trade View, 1000s of government and NGO users. Visualizes legal wildlife trade globally, by species and by country.
- CITES Wildlife Trade Database, 10000s of government, NGO, research users. Contains records of all legal wildlife trade under CITES (>40,900 species globally).
- IBAT, 1000s of businesses. Spatial tool for businesses to calculate potential impacts on nature. IBAT is an alliance of BirdLife International, Conservation International, IUCN, and UNEP-WCMC. -- An excerpt of UNEP-WCMC tools and systems (N. Burgess, personal communication, 2025)
This is just a short excerpt from the list, and many of these involve illegal activities (tracking them, not doing them!). The value in connecting them together and making them safely accessible by both humans and AI agents would be transformative to the global effort to save species from extinction, for example by carefully picking and choosing what trade agreements are signed between countries. A real-time version could change the course of human history for pivotal global biodiversity conferences where negotiations decide the future of many.
So I make a case that we must engineer robust permission protocols into the heart of how we share data, and not just for copyright and legal reasons. Some data must stay private for security, economic or geopolitical reasons, but that act of hiding knowledge currently makes it very difficult to take part in a collective knowledge network with our current training architectures. Perhaps federated learning will be one breakthrough, but I'm betting on
P4: Placement (data has weight, and geopolitics matters)
The final P is one that we thought we wouldn't need to worry about thanks to the cloud back in the day: placement. A lot of the digital data involved in our lives is spatial in nature (e.g. our movement data), but also must be accessed only from some locations. If we don't engineer in location as a first-class element of how we treat collective knowledge, it'll never be a truely useful knowledge companion to humans.
Physical location matters a lot for knowledge queries
We explained some spatial ideas in our recent
Physical containment creates a natural network hierarchy, yet we do not currently take advantage of this. Even local interactions between devices often require traversal over a wide-area network (WAN), with consequences for privacy, robustness, and latency.
Instead, devices in the same room should communicate directly, while physical barriers should require explicit networking gateways. We call this spatial networking: instead of overlaying virtual addresses over physical network connections, we use physical spaces to constrain virtual network addresses.
This lets users point at two devices and address them by their physical relationship; devices are named by their location, policies are scoped by physical boundaries, and spaces naturally compose while maintaining local autonomy. --
An Architecture for Spatial Networking , Millar et al, 2025
Think about all the times in your life that you've wanted to pull up some specific knowledge about the region you're in, and how bad our digital systems currently are at dealing with fine-grained location. I go to the gym every Sunday morning like clockwork with
Similarly, if you have a group of people in a meeting room, they should be able to use their physical proximity to take advantage of that inherent trust! For example, photos weren't allowed the Collective Intelligence workshop due to the Chatham House rules, but it would have been really useful to be able to get a copy of other people's photos for me to have a personal record of all the amazing whiteboarding brainstorming that was going on.
Protocol support for placement, combined with permissioning above, would allow us to build a personal knowledge network that actually fits into our lives based on where we physically are.
Where code and data is hosted also matters
When I started working on
An alternative is to decouple the names of the code and data from where it's hosted.
This is a feature explicitly supported by the AT Protocol that underpins Bluesky.
There are
The one we are using most here in my group is Tangled, which is a code hosting service
that I've
To wrap up the principle of placement, I've made a case for why explicit control over locations (of people, of code, of data, of predictions) matter a lot for collective intelligence, and should be factored into any system architecture for this. If you don't believe me, try asking your nearest LLM for what the best pub is near you, and watch it hallucinate and burn!

Next Directions
I jotted down these four principles to help organise my thoughts, and they're by no means set in stone. I am reasonably convinced that the momentum building around ATProto usage worldwide makes it a compelling place to focus prototyping and research efforts on, and they are working on plugging gaps such as permission support already. If you'd like to work on this or have pointers for me, please do let me know! I'll update this post as they come in.
(I'd like to thank many people for giving me input and ideas into this post, many of whom are cited above. In particular, Sadiq Jaffer, Shane Weisz, Michael Dales, Patrick Ferris, Cyrus Omar, Aadi Seth, Michael Coblenz, Jon Sterling, Nate Foster, Aurojit Panda, Ian Brown, Srinivasan Keshav, Jon Crowcroft, Ryan Gibb, Josh Millar, Hamed Haddadi, Sam Reynolds, Alec Christie, Bill Sutherland, Violeta Muñoz-Fuentes and Neil Burgess have all poured in relevant ideas, along with the wider ATProto and ActivityPub communities)
References
- Madhavapeddy et al (2025). Steps towards an Ecology for the Internet. Association for Computing Machinery. 10.1145/3744169.3744180
- Madhavapeddy (2025). Royal Society's Future of Scientific Publishing meeting. 10.59350/nmcab-py710
- Feng et al (2025). TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis. arXiv. 10.48550/arXiv.2506.20380
- Madhavapeddy (2025). A fully AI-generated paper just passed peer review; notes from our evidence synthesis workshop. 10.59350/k540h-6h993
- Madhavapeddy (2025). Oh my Claude, we need agentic copilot sandboxing right now. 10.59350/aecmt-k3h39
- Dales et al (2025). Yirgacheffe: A Declarative Approach to Geospatial Data. Association for Computing Machinery. 10.1145/3759536.3763806
- Madhavapeddy (2025). Thoughts on the National Data Library and private research data. 10.59350/fk6vy-5q841
- Madhavapeddy (2025). What I learnt at ICFP/SPLASH 2025 about OCaml, Hazel and FP. 10.59350/w1jvt-8qc58
- Jaffer et al (2025). AI-assisted Living Evidence Databases for Conservation Science. Cambridge Open Engage. 10.33774/coe-2025-rmsqf
- Millar et al (2025). An Architecture for Spatial Networking. arXiv. 10.48550/arXiv.2507.22687
- Reynolds et al (2025). Will AI speed up literature reviews or derail them entirely?. Nature Publishing Group. 10.1038/d41586-025-02069-w
- Madhavapeddy (2025). Socially self-hosting source code with Tangled on Bluesky. 10.59350/r80vb-7b441
- Madhavapeddy (2025). Programming for the Planet at ICFP/SPLASH 2025. 10.59350/hasmq-vj807
- Madhavapeddy (2025). The AIETF arrives, and not a moment too soon. 10.59350/agfta-8wk09
- Madhavapeddy (2025). Arise Bushel, my sixth generation oxidised website. 10.59350/0r62w-c8g63
- Madhavapeddy (2025). Exploring the biodiversity impacts of what we choose to eat. 10.59350/xj427-y3q48
- Madhavapeddy (2025). GeoTessera Python library released for geospatial embeddings. 10.59350/7hy6m-1rq76
- Madhavapeddy (2025). mlgpx is the first Tangled-hosted package available on opam. 10.59350/7267y-nj702
- Madhavapeddy (2025). Using AT Proto for more than just Bluesky posts. 10.59350/32rdt-zny05
- Fenner (2025). Rogue Scholar is becoming a German Non-Profit Organization. Front Matter. 10.53731/rftfk-qv692
- Fenner (2025). Rogue Scholar starts supporting versioning. Front Matter. 10.53731/nxp08-a9947
- Gibb (2025). Eilean. Front Matter. 10.59350/s621r-eg143
- Shumailov et al (2024). AI models collapse when trained on recursively generated data. Nature. 10.1038/s41586-024-07566-y
- Laud et al (2025). STACD: STAC Extension with DAGs for Geospatial Data and Algorithm Management. 10.1145/3759536.3763803

