.plan-26-13: Oxidised, standardised, and syndicated

Publishing the OxCaml Labs year-one review, POSSE and AI content disclosure for the web, adopting the geo-embeddings Zarr convention for TESSERA, action PROPL at PLDI, the death of the grant application, and NASA's new swathe lidar mission.

1 OxCaml Labs: year one in review

I've spent the last couple of weeks writing up a comprehensive summary of everything our group has done in my little OxCaml Labs group over the past year (with a lot of help from the team reviewing and contributing. When we started out again after OCaml Labs, we decided that regular blogging was going to be part of our collective culture, and I'm delighted to have so much content out there about our research.

The OxCaml page now covers the three pillars of our work: OCaml stewardship (compiler, CI, odoc, ppxlib, opam), live programming (browser-based OCaml, TESSERA notebooks, etc), and AI-assisted development in OCaml (benchmarks, AoAH, vibecoding).

In secret I'm actually most proud of the progress of my office plants this past year, it's kind of epic
In secret I'm actually most proud of the progress of my office plants this past year, it's kind of epic

2 Syndicating naturally with POSSE

Related to the question of blogging, I've been thinking about how to make posts from this site flow more naturally to other networks; right now I publish Atom feeds but then manually post as appropriate to the socials. There was a good discussion on HN about POSSE ("Publish on your Own Site, Syndicate Elsewhere"). This is the indieweb principle of owning your content and cross-posting summaries to silos like LinkedIn or more open networks like Bluesky and Mastodon.

The HN thread did surface an uncomfortable tension as loads of time, cross-posting looks like link dumping rather than genuine community participation. But on the flip side, I don't have time to carefully figure out the nuances of each social network. I'm also not keen on using LLMs to tailor it per-platform (which makes POSSE more of a "write once, adapt per audience" workflow) for my own writing. So right now I'm keeping the Atom feed as the canonical firehose of content from this side, and I manually post to specific communities (like the OCaml Discuss or Bluesky) when I have something to say to that audience.

One technical thing I learnt is that <link rel="canonical"> should point back to the original post, which this site should be doing but doesn't yet. The question is whether Atom feeds already serve this purpose or whether I need explicit syndication links per destination. This connects to the Tangled work as well — if I'm publishing to ATProto via StandardSite, the canonical URL should really be my own domain rather than the PDS. Something to prototype when I'm back from leave.

2.1 Disclosing AI content in HTML

On a related note, I've been wondering how to clearly separate AI-generated content from human-written content on my various sites. For example thicket.dev is entirely LLM-generated (Claude summarising O(x)Caml ecosystem activity), whereas this blog is human written and the AoAH library codes are somewhere in between. Right now scrapers like search engines have no machine-readable way to tell the difference, and I think it matters quite a lot to distinguish them.

Toby Jaffer pointed me at the AI Content Disclosure proposal, which is now a W3C Community Group. It proposes an ai-disclosure HTML attribute at a page or element-level granularity and there's also a page-level <meta name="ai-disclosure"> for uniform pages. The values are none, ai-assisted, ai-generated, autonomous, and mixed, with optional ai-model and ai-provider attributes, which seems reasonable... I'm going to integrate this into Ruminant/Thicket on my next foray into that.

An unexpected legal snippet I discovered is:

The EU AI Act Article 50 (effective August 2026) requires that AI-generated text content be "marked in a machine-readable format and detectable as artificially generated or manipulated." Major platforms (YouTube, Meta, TikTok) already require AI disclosure in their policies. A standard mechanism would serve both regulatory compliance and voluntary transparency. ai-content-disclosure, David E. Weekly, 2026

So this is going to be compulsory soon by this summer! Seems more useful than cookie popups too.

3 TESSERA joins the geo-embeddings convention

I put up a TESSERA Zarr v3 convention PR this week, documenting how community feedback reshaped my original Zarr layout. The short version is that I retired my short-lived TESSERA-specific convention in favour of a shared geo-embeddings Zarr convention that also works for other foundation models.

There are three big changes (years as a Zarr dimension, 4096-pixel shards, NCHW layout) and I've stuck them up live the TZE explorer now. It's fun to see how quickly this converged once I published the RFC a few weeks. These changes mean a single HTTP range request can now fetch a time-series of embeddings for any 10m pixel on Earth. Let's see what people build with it all in the coming months!

3.1 Semi-supervised learning for downstream tasks

Sadiq Jaffer pointed me at Lilian Weng's excellent tutorial on semi-supervised learning which covers consistency regularisation, pseudo-labelling, and hybrid methods like MixMatch and FixMatch.

I learnt quite a lot from this as I've been building downstream classifiers from TESSERA embeddings and haven't done much semi-supervised learning before. A task usualy has a handful of labelled points from ecologists, and millions of unlabelled pixels, which is a pretty good fit for a semi-supervised regime.

4 PROPL 2026: action stations

PROPL 2026 will return for its third outing, co-located with the big PLDI conference in Boulder Colorado (June 15-19ht). After two rounds of excellent talks that generated enthusiasm and our first published proceedings, we're shifting to an "action PROPL" format.

Rather than traditional paper presentations this year, each session will start with a brief talk then moves into working sessions where participants contribute to shared documents for a concrete planetary compute engine architecture. We're planning on covering everything from frontend design and programming models to AI sandboxing. We're still setting up a lightweight submission site, but start thinking about your ideas!

5 The grant application is dead

Tom Loosemore's blog pointed me Tom Waston writing about the "The Grant Application Is Dead. What Comes Next?". His argument is that LLMs have made polished grant applications trivially cheap to produce, breaking the information-scarcity filter that traditional review rounds relied on. I've had exactly the same problem from a reviewer perspective: the SNR ratio on proposals has cratered, and I've almost given up being a reviewer since I can't tell which proposals had genuine thought behind them and which were vibed into existence in an afternoon. It's different from research papers because grant proposals have very few results to evaluate. Maybe we've been vibe-reviewing them all along...

Something has to change; Tom's proposal is a reputation network of sorts. This is also what vibe coding will probably lead to for open source communities, but Michael Dales points out to me that such networks lead to great asymmetries in communities. Still, I'm planning to think more about this in the coming quarter.

And lest I sound too grumpy: I'm on the Royal Society Newton International Fellowship review committee this year, and I've just finished all the reviews for EuroSys 2026 (both fall and winter rounds, which was a lot of work). The good applications and papers are bloody excellent and not everything is LLMed yet and original work stands out. The problem is that the review burden in terms of hours I'm spending isn't scaling well since just one AI paper that I don't detect buries the original work under an ever-growing pile.

6.1 A wasm geocoding playground

The Walkthru Geocoding Playground is a browser-based geocoder running DuckDB-WASM over Overture Maps addresses, entirely client-side with no backend. This has made me rethink my OpenStreetMap approach; rather than building OCaml protobuf parsers and DuckDB bindings from scratch, perhaps the smarter move is to target DuckDB-WASM as the primary query layer over Overture or OSM's Parquet files and keep everything in the browser.

The reason I took pause is that this architecture is similar to what I've doing with TESSERA Zarr streaming. That involves HTTP range requests into columnar formats, also processed client-side. I need to think more about how this connects to the geocaml stack we're building; perhaps DuckDB-WASM for quick exploration and native OCaml code for production pipelines on servers.

6.2 Papers of the week

"A global biodiversity use data infrastructure acknowledging indigenous and local knowledge" argues for a biodiversity data infrastructure that properly integrates indigenous and local knowledge alongside remote sensing and eDNA. This connects directly to our recent biodiversity framework and discusses CARE principles and data sovereignty. The paper makes the case that current biodiversity databases systematically undervalue knowledge held by indigenous communities, and proposes governance structures to fix this. My collective knowledge principles includes "permission" as a core principle, but this paper pushes me to think harder about what that means when the data holders are communities rather than institutions, and also what spatial permissioning looks like in practice.

After a (to put it mildly) tense few years for US science funding, NASA has greenlit two cool earth science missions: the Earth Dynamics Geodetic Explorer (EDGE) and INCUS (a convective storm mapper). EDGE is potentially game-changing for our remote sensing work. Unlike GEDI which samples sparse footprints, EDGE does swathe lidar, scanning five 120m strips simultaneously and mapping the planet’s ice and land to within 3cm (!!!). This in turn means wall-to-wall canopy height data rather than the statistical interpolation we currently rely on. This in turn directly improves the canopy height estimates and forest carbon calculations and other downstream tasks. Exciting stuff!

The Earth Dynamics Geodetic Explorer (EDGE) will map the elevation of the planet’s ice and land to within 3 centimeters on flat ground. A laser instrument on the satellite will measure height across five 120-meter strips, allowing it to cover almost all of the planet’s surface more quickly than current instruments in orbit, such as ICESat-2 and GEDI. NASA greenlights two earth science missions, to researchers’ relief, Science, 2026

See also Amelia Holcomb's wonderful talk about GEDI analysis in this week's group seminar as well!

It's also been such a crazy few weeks that I haven't had time to listen to the latest Signals and Threads on testing that dropped this week. Something for next week, as I love the work that Antithesis has been doing.

References

[1]Madhavapeddy (2026). .plan-26-11: Bins, bollards, bots and biodiversity boffins. 10.59350/kg2a8-10w32
[2]Madhavapeddy (2026). Connecting the dots for biodiversity action from the NAS/Royal Society Forum. 10.59350/dy7d3-hdt43
[3]Feng et al (2026). Applications of the TESSERA Geospatial Foundation Model to Diverse Environmental Mapping Tasks. SSRN. 10.2139/ssrn.6142416
[4]Millar et al (2025). An Architecture for Spatial Networking. arXiv. 10.48550/arXiv.2507.22687
[5]Swinfield et al (2025). Learning lessons from over-crediting to ensure additionality in forest carbon credits. Cambridge Open Engage. 10.33774/coe-2025-29fk2
[6]Reynolds et al (2025). Will AI speed up literature reviews or derail them entirely?. Nature Publishing Group. 10.1038/d41586-025-02069-w
[7]Madhavapeddy (2026). TESSERA now supports the Zarr geo-embeddings convention proposal. 10.59350/c3hrq-zsx02
[8]Madhavapeddy (2026). Streaming millions of TESSERA tiles over HTTP with Zarr v3. 10.59350/tk0er-ycs46
[9]Madhavapeddy (2025). Programming for the Planet at ICFP/SPLASH 2025. 10.59350/hasmq-vj807
[10]Madhavapeddy (2025). Four Ps for Building Massive Collective Knowledge Systems. 10.59350/418q4-gng78
[11]Omar et al (2025). A FAIR Case for a Live Computational Commons. Association for Computing Machinery. 10.1145/3759536.3763802
[12]Madhavapeddy (2025). The Cambridge "Green Blue" competition to reduce emissions. 10.59350/y1g67-aq825
[13]Madhavapeddy (2025). mlgpx is the first Tangled-hosted package available on opam. 10.59350/7267y-nj702
[14]Madhavapeddy et al (2025). [No title found]. ACM. 10.1145/3759536
[15](2026). NASA greenlights two earth science missions, to researchers’ relief. 10.1126/science.zcuhiig
[16]Berthelot et al (2019). MixMatch: A Holistic Approach to Semi-Supervised Learning. https://arxiv.org/abs/1905.02249
[17]Sohn et al (2020). FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence. https://arxiv.org/abs/2001.07685