It's London Climate Action Week in the midst of a searing heatwave, which was a good backdrop for the Cloud-Native Geospatial Forum meeting (the first outside the US!). The venue was the Jellicoe where ARIA is based up on the top floor with a panoramic view over the City. CNG is a Radiant Earth initiative that I joined last year when I heard about their drive to make geospatial data available as a public good. This gathering was a rather excellent collection of 50 practitioners who were geeking out on coordinate systems and Zarr access patterns. Here are my notes of the day...
1 Unlocking the value of geospatial data in the commons
Jed Sundwall is the CEO of Radiant Earth, and is the person who previously founded AWS Open Data when on the social responsibility team at Amazon. He opened proceedings by explaining that the CNG consortium aims to bring together data users with wildly different budgets (big corps, individual contributors, etc) but who all have the same geospatial problems. His idea is that "data has a lot of potential energy stored inside it, and the sweet spot is how we maximise the potential value of that data."
CNG exists to turn this data into something useful by promoting modern cloud-native methods to drive down the cost of the public good overall. If you've read my writing recently, you'll know that the biggest professional problem in my life is juggling TESSERA embeddings without running out of disk space and/or crashing the Cambridge University egress bandwidth from eager downloaders! So I'm enormously excited at the idea of having a public data commons to help share the load.

Jed then showed us the actual unit economics of their operation. Source Cooperative is Radiant Earth's data-publishing utility built on cloud object storage. As of a few months ago, it has 6.16 PB stored, 739M objects, over a billion requests a month, at a blended cost of around $20/terabyte/month. Most vendors guard their cost base for competitive reasons, but as a not-for-profit they share it so the community can reason about what it actually costs to share data at scale.
An important technical point is that Source Coop uses Cloudflare R2 as their CDN, so hot data is edge-cached around the world. This was a big feature request for making TESSERA easier to access at the first Indian hackathon recently, as their egress international bandwidth was pretty bad.

2 From maps to decision systems
Luca Budello, the Geospatial Lead at Innovate UK Business Connect and formerly of the CCI, gave his policy view learnt from running the GeoAI Festival. He noted that decision-makers are interested in "reliable answers to real problems, grounded in trustworthy data", so data provenance and trustworthiness matters enormously.
Luca traced how every wave of geospatial innovation has changed the way we interpret the world: first maps drew the world, then dashboards queried it, then cloud APIs let us program it, then reporting and analytics informed decisions, and now the next step is decision systems that automate them.

The worrying part here is the jump from analytical workflows we can predict (deterministic, linear, static data) to these future decision systems that are probabilistic and dynamic and running continuous loops that can act directly on the world. AI is being imposed on us fast, so the obvious question is what the checks and balances are on these future decision systems. His answer was to build trust as a federated property and engineer in properties to the federation protocols to be explainable, accountable, auditable and trackable.
The analogy Luca used was that "GeoAI needs its open banking moment" to unlock the value of data the way open banking did for finance, and that the UK has a genuine advantage here with decades of authoritative, temporally rich, high-integrity data. The policy scaffolding is arriving (AI growth zones, BridgeAI and a sovereign AI fund) which roughly mirrors the EU's approach but with unfortunately rather less funding. He noted that a recent major AI policy report (I missed which one exactly) mentioned geospatial exactly once (urban planning), which is a strange omission for something so foundational as landuse planning is for a government.
3 Making Earth observation embeddings actionable
Earth Genome is a mission-driven non-profit out of California, behind ClimateTRACE and a dozen-plus geospatial products, funded by a hybrid of targeted projects and philanthropic donations for R&D. Noelia Jiménez Martínez and Glen Low walked us through their Earth Index work to make foundation-model embeddings usable by non-experts:
In the short time that Earth Index has been available, we’ve been amazed by the impact our users have made. Just to highlight a few: they’ve exposed narcotrafficking airstrips in the Peruvian Amazon; mapped illegal palm oil expansion in Brazil; uncovered hazardous quarries in the Balkans; and even mapped how rose farming is contributing to wetland loss in Uganda. -- Earth Genome, 2026
Their worked example was quantifying Jamaican seagrass with The Nature Conservancy. They used 70 datasets across four regions, blending field surveys, drone and high-resolution imagery with interpolation and modelling. The outputs need geolocalisation, a class (seagrass or not) and a density estimate as input data.
I was of course delighted to see our own Tessera embeddings up on the slide alongside AlphaEarth (Google, 64-dim) and OLMo Earth (Allen AI, vision transformer at 768-dim). Reassuringly, all three embedding models beat the no-embedding baseline comfortably on a benthic-environment case (using the Allen Coral Atlas ground truth). Their Tessera tests were using v1.0, and afterwards I explained how v1.1 has better coastal maps and so should see even better performance.


Glen's closing thoughts were that these systems should be globally comprehensive but also locally useful. This usually comes down to partnering and roles; choose where you add the most (multi-benefit) value; ensure that data and AI have actual users with a human-in-the-loop. Jamaica seagrass is an example of getting to a practical outcome rather than just frontier AI for its own sake.
4 Wrangling multidimensional data
Sol Cotton from Open Climate Fix used an excellent terminal/markdown presentation to guide us through the large multidimensional datasets challenge for cloud-native workflows, and how careful Zarr chunking strategies have transformed access efficiency across multiple dimensions.
This topic aligned well with the Icechunk discussions at PROPL last week. It looks like the geospatial community is converging on chunked/sharded, compressed, cloud-native ndarray storage. The main remaining question I have is how to find optimal chunking strategies for the queries and data appends, and also to layer a query interface over it (but I believe Icechunk has this capability).
I asked in the Q&A how coordinate transforms are handled, since in Tessera we use a MegaZarr one-group-per-utm-zone which requires some client-side stitching for ROIs that span UTM zones. There's no satisfactory answer for this yet, except perhaps shifting to equal size projections in the future.

I also gave a talk of my own in this session on Tessera, first explaining global 10m pixel-wise embeddings with open weights, our move to Zarr v3 and a preview of the 1.1 and (forthcoming) 2.0 models. It was a happy coincidence to follow Earth Genome with them having just independently benchmarked us, as it made it much easier for me to motivate some of our recent improvements on coastal regions!
The reception to my talk from the audience was awesome; I spent most of the rest of my attendance chatting with people about it all. Questions ranged from whether we could help with weather (answer: yes coming soon!), ice caps (nope but a similar approach might work), ocean (nope but see Laure Zanna's work) and ecological modeling (see below).
It's quite difficult to point people to our eeg.zulipchat.com chat service when discussing in person, so I'll look into printing some Tessera 'project cards' that we can hand to people with the QRCodes and links. This seems more useful than personal business cards (which I haven't used in years!)

5 Getting ground-level nature data into the cloud
Echo Labs (represented by the wonderful Molly Blank and Kaja Wasik) were up next and talking about how to turn ecological complexity into useful signal. As background, this is a FRO backed by ARIA and Convergent Research who visited the CCI earlier in the year. They've also just launched their shiny new website this week!
Echo want to take fragmented, multimodal ground-level ecological data and transform it into representations of ecosystem condition ('ecosystem vectors'), as a shared foundation for measuring change and evaluating impact of interventions on the ground.

Their proposed primitive for the representation is an ecosystem state vector, which is a compact representation that fuses different ground-level modalities (camera traps, acoustics, sensors) into one object the client can compute over.
Tessera is an obvious source of input here, but also a lot of other modalities of sensor data and ground truth info from the CLR would be useful to them as well. David Coomes has been discussing this with them since their visit to Cambridge!

Their roadmap is staged sensibly to me with a first sprint on a proof that multimodal ground signals carry useful information in the first place. Then they're working on mid-term pilot projects grounding that utility in an ecological intervention context.
Longer-term, they want to release a shared resource/benchmark of embedded multimodal sensor data for research, policy and industry. I'm extremely excited to see other people intending to work on benchmarks in this space, as it's really difficult to evaluate techniques right now.

In the Q&A, I brought up a topic that Mike Harfoot and I have been discussing. We're wondering whether synthetic data generation (e.g. from a process model like Madingley) could help to accelerate the training of their ecosystem model, since ground truth data is quite sparse. This isn't on their near-term roadmap but one of the things they're considering.
I had a quick chat with Stefan Istrate (who has been working with Silviu Petrovan on frog vision models) and am delighted to see that he's recently joined Echo as their head of machine learning! They're shaping up to have a very classy team indeed and I look forward to seeing how their ecosystems vectors progress.
6 The invisible settlements of Argentina
The talk I enjoyed the most was Nissim Lebovits (Radiant Earth), who explained the Barrios Visibles project (read paper as well). They used building-footprint data to surface a systematic population undercount in Argentina's informal settlements. And not just a small undercount: he reported they found some 3.4 million people missing from their national record, a significant fraction of the estimated 45 million inhabitants across the country!
RENABAP, Argentina's official registry of barrios populares, lists 1.24 million families across 6,467 settlements. But satellite imagery reveals 1.97 million buildings within those same boundaries—59% more structures than recorded families.
This isn't about the registry being outdated. RENABAP's own quality-control protocol requires that family counts match dwellings visible in satellite imagery. The gap documented here is a departure from that standard. Closing it requires methodological change, not just updated data. -- Barrios Visibles explainer, 2026

The talk was (to me anyway) an incredible demonstration of cloud-native open data doing politically consequential work. He ran a big query over the Parquet files hosted on Source Coop, doing in a few queries a full spatial cross-referencing run that combined Google+Microsoft+OpenStreetMap building footprints against the official registries from Argentina.
The point of doing this over the hosted Parquet is that it's not necessary to download everything to run the query (hence the importance of the cloud native approach).
As an example, in La Plata alone, roughly 72,000 building footprints intersect polygons for which the registry lists only 34,000 families. This kind of gap seems very important to account for when budgeting services and infrastructure development in the country.

Another point that he made (show in the video below) is that debugging/visualising this dataset is pretty easy, since the entire map is zoomable. To validate a given region, the officials just directly navigate there and find the polygons which are marked as settlements, as use normal visual satellite imagery to verify that there are in fact settlements there.
This left me wondering about the role of OpenStreetMap here: it was used as an input, but what's the mechanism to then propose updates to it so that the crowdsourced database remains accurate? I met another attendee Petya Kangalova who works for Humanitarian OpenStreetMap, which is a community of mappers focussed on the disaster response utility of the database.
Petya explained to me that HumOSM has a bunch of specialised tech products for disaster response. Two cool ones are a multiuser coordination layer for planning how to update an area, and OpenAerialMap to explore decently licensed imagery.
7 Compressing the Earth
Jacqueline Campbell of Asterisk Labs talked next. She's a planetary scientist who came to Earth's oceans via looking for life in the Mars dust, and presented "Earth Compress".
Their goal is to have open source, publicly owned, compression infrastructure for a variety of Earth data, built with the National Oceanography Centre. The domain-aware compression stack will make petabyte-scale analysis accessible to everyone rather than only to those with the biggest egress budgets:
The critical bottleneck limiting high-impact environmental research is how difficult it has become to process increasingly huge and complex datasets. To overcome this bottleneck we will build open source, AI-powered software infrastructure and data-as-AI models that are trustworthy and publicly-owned. We will not only build technology, but establish a multi-institutional cooperative, reducing current fragmentation and complexity to massively increase the number of organisations that can access and manipulate Earth-scale environmental data.
Our software infrastructure will simultaneously empower data producers (so they can easily create Earth Embedding models) and data users (so they can easily access and manipulate them). Therefore, we will enable a transformative shift, massively reducing the compute costs and complexity for all. -- Future of Environment Data Collective
This describes the problems we're having in Tessera pretty accurately. Srinivasan Keshav has also been leading an effort on our side to use residual vector quantization to dramatically shrink the size of the Tessera embeddings, so we've started a direct conversation with the Asterisk team to see how we can join forces!
In theory, this will allow for hugely faster 'sketches' of global analyses without much loss in accuracy for many downstream tasks. And because the Asterisk team is also applying the same trick to other embeddings, it'll make fusion of multimodal data sources much easier as well!

Their architecture splits a compression toolbox (i.e. either classical compression, AI-based data fields, and AI embedding models like Tessera) and feeds into tailored data archives (file, columnar and vector databases) on the server side, with a transmission protocol that streams dynamic data through a manager out to decompressed data and embeddings on the client side.
I don't think there are many standards for what this custom VBR decoder might be yet, so this seems a good opportunity to establish one, much like the Zarr conventions community is doing.

8 When the developer is an agent
My laptop started running out of juice (too many demos), so my remaining notes are a bit sketchy.
Stefan Amberger, co-founder of Tilebox, made the case that Tilebox is the "operating loop" for geospatial data workflows, and asked what changes when the developer is a semi-autonomous LLM agent.
Their answer is to establish a single workflow loop ("discover -> define -> run -> observe -> improve") that's shared across three kinds of callers. These are either humans on a console, LLM agents over MCP, and conventional software via APIs. The work is orchestrated to wherever the data is, either between the cloud or over to on-prem and edge devices.

As with the discussion at last week's PROPL, there's quite a wide consensus that agents will join the coding loop whether we like it or not. So the focus needs to shift to how we keep not only the data source auditable, but also the coding loop more verifiable.
I did a quick poll of the audience to find out which of the geotessera users did coding by hand, and who used agents. I couldn't find a single person who'd use my lovely library by hand. Every single person used a variety of Claude to Codex. There were no local agent users, and no Copilot users, so that's a sign of a rarified crowd.
9 Lightning talks
The afternoon lightning round was a tour of practical pipelines. Jake Wilkins (Epoch Blue) showed how they go from days to minutes with a just-in-time pipeline for plot-level supply-chain analytics, aimed at helping companies comply with the forthcoming EUDR deforestation-compliance deadline.
I had a chance to chat to Jake afterwards and show him our FOOD provenance paper and the interactive explorer. What's really cool about Jake's work is that they're using global embeddings to calculate probabilities at the 10m2 level of a commodity being produced, whereas our (pre-Tessera) work depends on FAO provenance which is only at a national level.

The Epoch Blue process runs customer-supplied locations and addresses through a geocode-and-verify loop, calculates probabilistic supply sheds down to delineated commodity plots, and then merges this with environmental metrics (deforestation, emissions, biodiversity, water use). Jake also wrote a nice piece on using AlphaEarth embeddings to detect palm-oil mill effluent lagoons. I really want to try this with Tessera as well...

The other lightning talks were great; Alper Dincer (Climingo) spoke on global drought mapping with H3, GeoParquet and DuckDB. Ross Slater (Leeds) on going cloud-native without the cloud for Antarctic ice dynamics; and Petya Kangalova (HOT) who I mentioned earlier on cloud-native open imagery for disaster response. Ross has a really interesting usecase which could benefit from a Tessera-style Barlow Twins approach, but using different satellite data (S1/S2 don't go that far north), which I need to think about more.
10 Panels and closing thoughts
The day closed with a panel with David Eaves (UCL), Jack Kelly (dynamical.org and Open Climate Fix), Niall Robinson (NVIDIA) and Kaja Wasik (Echo Labs). Frustratingly, the heatwave had thrown the trains into the usual chaos and I had to leg it to King's Cross to get back to Cambridge, so I missed it entirely.
I did have a chinwag with Niall though, as he's been helping us train Tessera v2 on the Isambard-AI cluster (part of the UK's AI Research Resource). Jack's dynamical.org is also publishing weather data via Icechunk, which I'm planning to use in some weather forecasting research we're doing with Tessera atm.
Jed, Luca, Niall and I all talked about how many of the day's talks came back to matters of provenance and trust. The encouraging thing is that I think we now have many of the pieces in place to do something concrete about it, especially after last week's PROPL living document as well showed the number of PL researchers who want to dive into this problem alongside systems people. An ATProto-native trust graph like Tangled's evidence-backed vouching (which I wrote about a few weeks ago) could also anchor data provenance to an identity graph that's reusable across different services (see Semble for example), and supporting evidence-driven practice.
This was a first great experience of the cloud-native geospatial community in London for me! Thanks to Jed and Radiant Earth for convening it; next time, ideally, in slightly cooler weather, but the coffee in the Jellicoe was top notch so that made up for the burns!

