.plan-26-13: Oxidised, standardised, and syndicated

Anil Madhavapeddy

doi:10.59350/ddx61-wd948

.plan-26-13: Oxidised, standardised, and syndicated

#tessera #oxcaml #zarr #standards #ai #spatial #web #policy[cite]·29 Mar 2026

Publishing the OxCaml Labs year-one review, POSSE and AI content disclosure for the web, adopting the geo-embeddings Zarr convention for TESSERA, action PROPL at PLDI, the death of the grant application, and NASA's new swathe lidar mission.

1 OxCaml Labs: year one in review

I've spent the last couple of weeks writing up a comprehensive summary of everything our group has done in my little OxCaml Labs group over the past year (with a lot of help from the team reviewing and contributing. When we started out again after OCaml Labs concluded, we decided that regular blogging was going to be part of our collective culture, and I'm delighted to have so much content out there about our research! Academia is no longer just about publishing papers but also about more dynamic communication to get the pulse of current events.

The OxCaml page therefore now covers the three pillars of our work: OCaml stewardship (compiler, CI, odoc, ppxlib, opam), live programming (browser-based OCaml, TESSERA notebooks, etc), and AI-assisted development in OCaml (benchmarks, AoAH, vibecoding). Feedback is extremely welcome!

In secret I'm actually most proud of the progress of my office plants this past year, it's kind of epic

2 Syndicating naturally with POSSE

Related to the question of blogging, I've been thinking about how to make posts from this site flow more naturally to other networks; right now I publish Atom feeds but then manually post as appropriate to the socials. There was a good discussion on HN about POSSE ("Publish on your Own Site, Syndicate Elsewhere"). This is the indieweb principle of owning your content and cross-posting summaries to silos like LinkedIn or more open networks like Bluesky and Mastodon.

The HN thread did surface an uncomfortable tension as loads of time, cross-posting looks like link dumping rather than genuine community participation. But on the flip side, I don't have time to carefully figure out the nuances of each social network. I'm also not keen on using LLMs to tailor it per-platform (which makes POSSE more of a "write once, adapt per audience" workflow) for my own writing. So right now I'm keeping the Atom feed as the canonical firehose of content from this side, and I manually post to specific communities (like the OCaml Discuss or Bluesky) when I have something to say to that audience.

One technical thing I learnt is that <link rel="canonical"> should point back to the original post, which this site should be doing but doesn't yet. The question in my mind is whether Atom feeds already serve this purpose or whether I need explicit syndication links per destination. This connects to the Tangled work as well, as if I'm publishing to ATProto via StandardSite, the canonical URL should really be my own domain rather than the PDS. Something to prototype when I'm back from leave. (Update: I got asked about what I meant here more concretely. Consider this post being syndicated to https://leaflet.pub or https://pckt.blog, at which point all those URLs should have a canonical link back to this post. I'm also not sure if Atom feeds need to do the same thing from this original site or not.)

2.1 Disclosing AI content in HTML

On a related note, I've been wondering how to clearly separate AI-generated content from human-written content on my various sites. For example thicket.dev is entirely LLM-generated (Claude summarising O(x)Caml ecosystem activity), whereas this blog is human written and the AoAH library codes are somewhere in between. Right now scrapers like search engines have no machine-readable way to tell the difference, and I think it matters quite a lot to distinguish them.

Toby Jaffey pointed me at the AI Content Disclosure proposal, which is now a W3C Community Group. It proposes an ai-disclosure HTML attribute at a page or element-level granularity and there's also a page-level <meta name="ai-disclosure"> for uniform pages. The values are none, ai-assisted, ai-generated, autonomous, and mixed, with optional ai-model and ai-provider attributes, which seems reasonable... I'm going to integrate this into Ruminant/Thicket on my next foray into that.

An unexpected legal snippet I discovered is:

The EU AI Act Article 50 (effective August 2026) requires that AI-generated text content be "marked in a machine-readable format and detectable as artificially generated or manipulated." Major platforms (YouTube, Meta, TikTok) already require AI disclosure in their policies. A standard mechanism would serve both regulatory compliance and voluntary transparency. ai-content-disclosure, David E. Weekly, 2026

So this is going to be compulsory soon by this summer! Seems more useful than cookie popups too.

3 TESSERA joins the geo-embeddings convention

I put up a TESSERA Zarr v3 convention PR this week, documenting how community feedback reshaped my original Zarr layout. The short version is that I retired my short-lived TESSERA-specific convention in favour of a shared geo-embeddings Zarr convention that also works for other foundation models.

There are three big changes (years as a Zarr dimension, 4096-pixel shards, NCHW layout) and I've stuck them up live the TZE explorer now. It's fun to see how quickly this converged once I published the RFC a few weeks. These changes mean a single HTTP range request can now fetch a time-series of embeddings for any 10m pixel on Earth. Let's see what people build with it all in the coming months!

3.1 Semi-supervised learning for downstream tasks

Sadiq Jaffer pointed me at Lilian Weng's excellent tutorial on semi-supervised learning which covers consistency regularisation, pseudo-labelling, and hybrid methods like MixMatch and FixMatch.

I learnt quite a lot from this as I've been building downstream classifiers from TESSERA embeddings and haven't done much semi-supervised learning before. A task usualy has a handful of labelled points from ecologists, and millions of unlabelled pixels, which is a pretty good fit for a semi-supervised regime.

4 PROPL 2026: action stations

PROPL 2026 will return for its third outing, co-located with the big PLDI conference in Boulder Colorado (June 15-19ht). After two rounds of excellent talks that generated enthusiasm and our first published proceedings, we're shifting to an "action PROPL" format.

Rather than traditional paper presentations this year, each session will start with a brief talk then moves into working sessions where participants contribute to shared documents for a concrete planetary compute engine architecture. We're planning on covering everything from frontend design and programming models to AI sandboxing. We're still setting up a lightweight submission site, but start thinking about your ideas!

5 The grant application is dead

Tom Loosemore's blog pointed me Tom Waston writing about the "The Grant Application Is Dead. What Comes Next?". His argument is that LLMs have made polished grant applications trivially cheap to produce, breaking the information-scarcity filter that traditional review rounds relied on. I've had exactly the same problem from a reviewer perspective: the SNR ratio on proposals has cratered, and I've almost given up being a reviewer since I can't tell which proposals had genuine thought behind them and which were vibed into existence in an afternoon. It's different from research papers because grant proposals have very few results to evaluate. Maybe we've been vibe-reviewing them all along...

Something has to change; Tom's proposal is a reputation network of sorts. This is also what vibe coding will probably lead to for open source communities, but Michael Dales points out to me that such networks lead to great asymmetries in communities. Still, I'm planning to think more about this in the coming quarter.

And lest I sound too grumpy: I'm on the Royal Society Newton International Fellowship review committee this year, and I've just finished all the reviews for EuroSys 2026 (both fall and winter rounds, which was a lot of work). The good applications and papers are bloody excellent and not everything is LLMed yet and original work stands out. The problem is that the review burden in terms of hours I'm spending isn't scaling well since just one AI paper that I don't detect buries the original work under an ever-growing pile.

6 Fun links

6.1 A wasm geocoding playground

The Walkthru Geocoding Playground is a browser-based geocoder running DuckDB-WASM over Overture Maps addresses, entirely client-side with no backend. This has made me rethink my OpenStreetMap approach; rather than building OCaml protobuf parsers and DuckDB bindings from scratch, perhaps the smarter move is to target DuckDB-WASM as the primary query layer over Overture or OSM's Parquet files and keep everything in the browser.

The reason I took pause is that this architecture is similar to what I've doing with TESSERA Zarr streaming. That involves HTTP range requests into columnar formats, also processed client-side. I need to think more about how this connects to the geocaml stack we're building; perhaps DuckDB-WASM for quick exploration and native OCaml code for production pipelines on servers.

6.2 Papers of the week

"A global biodiversity use data infrastructure acknowledging indigenous and local knowledge" argues for a biodiversity data infrastructure that properly integrates indigenous and local knowledge alongside remote sensing and eDNA. This connects directly to our recent biodiversity framework and discusses CARE principles and data sovereignty. The paper makes the case that current biodiversity databases systematically undervalue knowledge held by indigenous communities, and proposes governance structures to fix this. My collective knowledge principles includes "permission" as a core principle, but this paper pushes me to think harder about what that means when the data holders are communities rather than institutions, and also what spatial permissioning looks like in practice.

After a (to put it mildly) tense few years for US science funding, NASA has greenlit two cool earth science missions: the Earth Dynamics Geodetic Explorer (EDGE) and INCUS (a convective storm mapper). EDGE is potentially game-changing for our remote sensing work. Unlike GEDI which samples sparse footprints, EDGE does swathe lidar, scanning five 120m strips simultaneously and mapping the planet’s ice and land to within 3cm (!!!). This in turn means wall-to-wall canopy height data rather than the statistical interpolation we currently rely on. This in turn directly improves the canopy height estimates and forest carbon calculations and other downstream tasks. Exciting stuff!

The Earth Dynamics Geodetic Explorer (EDGE) will map the elevation of the planet’s ice and land to within 3 centimeters on flat ground. A laser instrument on the satellite will measure height across five 120-meter strips, allowing it to cover almost all of the planet’s surface more quickly than current instruments in orbit, such as ICESat-2 and GEDI. NASA greenlights two earth science missions, to researchers’ relief, Science, 2026

See also Amelia Holcomb's wonderful talk about GEDI analysis in this week's group seminar as well!

It's also been such a crazy few weeks that I haven't had time to listen to the latest Signals and Threads on testing that dropped this week. Something for next week, as I love the work that Antithesis has been doing.

References

[1]Madhavapeddy (2026). .plan-26-11: Bins, bollards, bots and biodiversity boffins. 10.59350/kg2a8-10w32

[2]Madhavapeddy (2026). Connecting the dots for biodiversity action from the NAS/Royal Society Forum. 10.59350/dy7d3-hdt43

[3]Feng et al (2026). Applications of the TESSERA Geospatial Foundation Model to Diverse Environmental Mapping Tasks. SSRN. 10.2139/ssrn.6142416

[4]Millar et al (2025). An Architecture for Spatial Networking. arXiv. 10.48550/arXiv.2507.22687

[5]Swinfield et al (2026). Learning lessons from over-crediting to ensure additionality in forest carbon credits. Nature Publishing Group. 10.1038/s41467-026-71552-3

[6]Reynolds et al (2025). Will AI speed up literature reviews or derail them entirely?. Nature Publishing Group. 10.1038/d41586-025-02069-w

[7]Madhavapeddy (2026). TESSERA now supports the Zarr geo-embeddings convention proposal. 10.59350/c3hrq-zsx02

[8]Madhavapeddy (2026). Streaming millions of TESSERA tiles over HTTP with Zarr v3. 10.59350/tk0er-ycs46

[9]Madhavapeddy (2025). Programming for the Planet at ICFP/SPLASH 2025. 10.59350/hasmq-vj807

[10]Madhavapeddy (2025). Four Ps for Building Massive Collective Knowledge Systems. 10.59350/418q4-gng78

[11]Omar et al (2025). A FAIR Case for a Live Computational Commons. Association for Computing Machinery. 10.1145/3759536.3763802

[12]Madhavapeddy (2025). The Cambridge "Green Blue" competition to reduce emissions. 10.59350/y1g67-aq825

[13]Madhavapeddy (2025). mlgpx is the first Tangled-hosted package available on opam. 10.59350/7267y-nj702

[14]Madhavapeddy et al (2025). [No title found]. ACM. 10.1145/3759536

[15](2026). NASA greenlights two earth science missions, to researchers’ relief. 10.1126/science.zcuhiig

[16]Berthelot et al (2019). MixMatch: A Holistic Approach to Semi-Supervised Learning. https://arxiv.org/abs/1905.02249

[17]Sohn et al (2020). FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence. https://arxiv.org/abs/2001.07685

Weeknotes 2026-W18May 2026

Jon Sterling. An injury healing I am moving a bit slower than I would like at the start of term because I injured myself last week—I took a tumble and tremendously crunched my toes. Nothing broken, and it seems to be healing up, but a painful experience. Preparing a few manuscripts Last year I managed to get tw…

A Proposal for Voluntary AI Disclosure in OCaml CodeApr 2026

Proposing a voluntary, machine-readable AI content disclosure scheme for OCaml spanning opam packages, dune, and per-module attributes, aligned with the W3C AI Content Disclosure vocabulary.

Learning lessons from over-crediting to ensure additionality in forest carbon creditsApr 2026

Tom Swinfield, Abby Williams et al. — Nature Communications

TESSERA Zarr v3 streaming (take 2)Mar 2026

This is a followon to my first prototype showing a unified Zarr v3 store with consolidated metadata. The video shows the Tessera Zarr Explorer (https;//tze.geotessera.org) illustrating how to explore the TESSERA geospatial embeddings using just a webbrowser, HTTPS streaming and WebGL/wasm to do analyses such as classification and segmentation. See https://geotessera.org for more information

TESSERA now supports the Zarr geo-embeddings convention proposalMar 2026

Community feedback reshaped our Zarr store layout — years became a dimension, shards got bigger, and we retired the TESSERA-specific convention in favour of a shared geo-embeddings standard that also covers other models.

.plan-26-11: Bins, bollards, bots and biodiversity boffinsMar 2026

Evidence synthesis at the DEFRA science conference, TESSERA transcoding and building a new SPA, OpenStreetMap/DuckDB bindings in OxCaml, and early thoughts on vibecoding etiquette.

Streaming millions of TESSERA tiles over HTTP with Zarr v3Mar 2026

How we restructured TESSERA's geospatial embeddings from millions of individual numpy files into sharded Zarr v3 stores for efficient HTTP streaming, enabling everything from single-pixel mobile lookups to regional-scale analysis with just a couple of range requests.

Connecting the dots for biodiversity action from the NAS/Royal Society ForumMar 2026

Summary of the Nine Recommendations and Biodiversity Monitoring Standards Framework papers from the NAS/Royal Society US-UK Forum in summer 2025, and how they connect to my work on collective knowledge systems, TESSERA, and evidence synthesis.

Applications of the TESSERA Geospatial Foundation Model to Diverse Environmental Mapping TasksJan 2026

Zhengpeng Feng, Clement Atzberger et al.

2025 Advent of Agentic Humps: Building a useful O(x)Caml library every dayDec 2025

An exploration of agentic programming through building useful OCaml libraries daily using Claude Code while establishing groundrules for responsible development.

Four Ps for Building Massive Collective Knowledge SystemsNov 2025

Design principles for collective knowledge systems—permanence, provenance, permission, and placement—that enable robust networks for evidence-based decision making.

Programming for the Planet at ICFP/SPLASH 2025Oct 2025

Report on second Programming for the Planet workshop featuring papers on climate modeling, geospatial computation and planetary-scale collaborative systems.

An Architecture for Spatial NetworkingOct 2025

Josh Millar, Ryan Gibb et al.

A FAIR Case for a Live Computational CommonsOct 2025

Cyrus Omar, Michael Coblenz et al. — Proceedings of the 2nd ACM SIGPLAN International Workshop on Programming for the Planet

mlgpx is the first Tangled-hosted package available on opamAug 2025

The Tangled git forge has recently gained support for CI, stacked pull requests and also the Dune build system can generate Tangled metadata easily now for OCaml packages hosted there.

Will AI speed up literature reviews or derail them entirely?Jul 2025

Sam Reynolds, Alec Christie et al. — Nature

The Cambridge "Green Blue" competition to reduce emissionsMay 2025

Thinking about a Cambridge "Green Blue" competition to reduce emissions among Colleges, promoting cooperation through a semi-competitive league

OxCaml LabsJan 2025

TESSERA, a pixelwise geospatial foundation modelJan 2025

Remote Sensing of NatureJan 2023

Forest preservation and restorationSep 2021

Forest preservation and restoration efforts and resources compiled for reference.

OCaml LabsJan 2012