TESSERA now supports the Zarr geo-embeddings convention proposal

Anil Madhavapeddy

doi:10.59350/c3hrq-zsx02

TESSERA now supports the Zarr geo-embeddings convention proposal

#tessera #spatial #zarr #ai #satellite #standards[cite]·27 Mar 2026

Community feedback reshaped our Zarr store layout — years became a dimension, shards got bigger, and we retired the TESSERA-specific convention in favour of a shared geo-embeddings standard that also covers other models.

The ink's barely dry on my last RFC post about streaming millions of TESSERA tiles over HTTP using Zarr v3 sharding, and it's out of date! The store layout I'd carefully craftd lasted about 48 hours before a discussion from Jeff Albrecht and Len Strnad convinced me to rethink my approach.

This led me to build a "MegaZarr" unified store, and retire our TESSERA-specific convention to instead integrate with a shared geo-embeddings Zarr convention that's appeared in the last week. Things are moving really quickly in the world of geospatial embeddings; I'll explain some of the gory details about the store here, and here's a video to show it in action or try it for yourself:

1 Arise, the MegaZarrrrr

The original layout had a separate Zarr store per year, an HWB array layout, and 256-pixel shards with very fine tile granularity. After the discussions, instead of a Zarr store per year, we now have a single store with time as a first-class dimension, which I dubbed the MegaZarr. The unified array shape is now (T, B, H, W) or NCHW ordering which is easier to use in ML frameworks like PyTorch. Each year's data lives in its own temporal slice, and the time coordinate array stores the years explicitly so it can just be indexed by the integer year directly.

You can see this in the video above in the new 'explorer' mode, whereby you can just go to https://tze.geotessera.org and click around individual tiles and 10m² pixels and see the embeddings all the way from 2017 to 2025 where available. We can't easily prepend years to an existing store without rewriting the whole thing, so we zero-fill the entire year range upfront and selectively replace years as they're generated. We use dual sentinels in the scales array to do this via +inf for "land but no data yet" and NaN for water areas that don't have valid embeddings. This lets a single isfinite() call identify valid terrestrial pixels in a tile.

Switching from HWB to NCHW format also aligns with how ML frameworks expect data since all 128 bands for a pixel are contiguous in a single inner chunk, which is what downstream tasks usually need. The common operation of grabbing a single year is fine since the time dimension coming first means temporal slicing is cheap.

I've also tweaked the shard/chunk balancing sizes to a larger 4096×4096 shard size vs the original 256×256, and a slightly larger inner chunk of (1,128,32,32). There are still only two HTTP range requests for a single pixel, but now each shard covers a larger (around 40km²) area with far fewer files on the server. The TZE explorer confirms this still works reasonably well on mobile.

Mark Elvers is installing SSDs today by the bushel so we can keep up with all the storage requirements here. We do need cloud hosting for all this Zarr as we're over a petabyte raw now in the University and it's nervous work...

2 New geotessera tooling around Zarr

All this also means that all my hard work on geotessera will also soon be obsolete! A quickstart to grab embeddings using just xarray/Zarr from Python looks like

# params, year (int), x, y
ds = xr.open_zarr(args.store, group=f"utm{zone:02d}",
       zarr_format=3, consolidated=True,
       chunks={"time": 1, "band": 128, "y": 4096, "x": 4096})
pixel = ds.sel(time=year, x, y, method="nearest")
emb_int8 = pixel["embeddings"].values
scale = float(pixel["scales"].values)
embedding = emb_int8.astype(np.float32) * scale
for i in range(min(16, len(embedding))):
  print(f"{i:4d}  {int(emb_int8[i]):5d}  {embedding[i]:10.6f}")

A few things of note here are that the use of coordinate arrays means that you can index into the Zarr array using the natural year integer and the UTM x/y coordinates. After that you dequantise using the scales array, and the dimensions are in an array for you.

Now, there is still some useful housekeeping whereby Geotessera will continue to be useful for Zarr. Firstly, it can provide coordinate lookups and also a name resolver to find a copy of the embeddings (we are getting offers to mirror and version from several other people, which is great). Secondly, the biggest downside to Zarr is that it doesn't have a quick index of sparsity, which means that coverage tests can be expensive in terms of HTTP HEAD requests. So I suspect I'll keep the existing GeoParquet tile registry around to provide a quick lookup for clients that want to test large-scale tile existence. However, downloading this index will be optional so clients can get started immediately without an expensive database check.

Using the new wrapper (which I'm tidying up for a release) means you just need to:

from geotessera.store import GeoTesseraZarr

gt = GeoTesseraZarr()  # defaults to public store
emb = gt.sample_at(-2.97, 53.44, year=2025)          # (128,) float32
mosaic, transform = gt.read_region(bbox, year=2025)  # (H, W, 128)

3 From a TESSERA convention to a community one

While I was iterating on the store layout, a Clark University hackathon on geo-embeddings created an embeddings-stac-specification repository to connect STAC with vector embeddings. We had a productive discussion about how TESSERA's QAT-compressed embeddings differ from other models.

There's also another geo-embeddings Zarr convention, which is a model-agnostic standard for describing geospatial embeddings in Zarr stores. It currently has examples for Clay (768-dim chip embeddings on a regular grid) and AEF (64-dim patch embeddings with global quantisation).

To make TESSERA fit into this, I proposed several additions to the convention:

Per-pixel scale quantization ("method": "per_pixel_scale"). Unlike the other models, TESSERA uses quantization-aware training with a per-pixel scale array, and not a single global scale. The convention therefore now supports a scale_array field pointing to the array name and a nodata field for sentinel values.
UTM zone spatial layout ("spatial_layout": "utm_zones"). In order to minimise distortion due to varying pixel sizes at different points of the earth, TESSERA stores each UTM zone as a separate group rather than reprojecting to a global grid. This is a bit less seamless than the simpler global layout, but prioritises accuracy until we can measure the effect of the distortion. We use Zarr groups for this, so the root store has subgroups with utmXY that the client opens as needed.
Multiple source datasets. TESSERA fuses Sentinel-1 (radar) and Sentinel-2 (optical), so geoemb:source_data now accepts an array of URLs rather than just one.
Build version tracking. We're going to be releasing multiple variations and models of our embeddings as the technology advances, so geoemb:build_version also records which version of the pipeline produced the store. I'm hoping to use this to record end-to-end provenance for every single embedding.

See the TESSERA example JSON for all this in one place. With the geoemb: convention covering everything we need, the zarr-convention-tessera repo can now be retired after a glorious two weeks of life. The Zarr web inspector confirms our live store is compliant, except for those in-flight proposals that still need to be ratified.

4 What's next

Thanks to everyone in the Zarr community for their really useful input. I wouldn't be surprised if we need to tweak my (newbie) layout a bit more yet, but comments are welcome on my PR. Once things settle, I'm back to working on the OxCaml Zarr library to add the sharding support I need.

Also to throw a bit of a clanker into my delicate Zarr work, we're working on a revision of the TESSERA model that supports Matryoshka representations, which means I need to figure out how to support slicing by dimension as well. After all this careful work in v1.0 to make the embeddings contiguous in memory, we will soon need to be able to get just the first few dimensions out of 128 to do quick sample analyses. Code doesn't survive very long any more with all the progress in training!

The live Zarr v3 TESSERA store is browsable now at the TZE explorer, and the embeddings convention is open for feedback at geo-embeddings/embeddings-zarr-convention.

References

[1]Feng et al (2025). TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis. arXiv. 10.48550/arXiv.2506.20380

[2]Madhavapeddy (2026). Streaming millions of TESSERA tiles over HTTP with Zarr v3. 10.59350/tk0er-ycs46

[3]Madhavapeddy (2025). Four Ps for Building Massive Collective Knowledge Systems. 10.59350/418q4-gng78

[4]Madhavapeddy (2025). GeoTessera Python library released for geospatial embeddings. 10.59350/7hy6m-1rq76

[5]Kusupati et al (2024). Matryoshka Representation Learning. arXiv. 10.48550/arXiv.2205.13147

A scorching CNG London during Climate Action WeekJun 2026

My notes from the first Cloud-Native Geospatial Forum gathering outside the US, up on the top floor of the Jellicoe; covering Source Cooperative's open data economics, Argentina's invisible settlements, and provenance and trust for geospatial decisionmaking.

Tessera at CVPR 2026, and the front page of the European Space Agency!Jun 2026

TESSERA gets its CVPR debut in Denver, the BBC hedgehog story trots on into national news, and the European Space Agency puts our model on their homepage!

.plan-26-17: Unwedging kernels, dogfood deployments, and managing beef leakageApr 2026

Welcoming Akshay to Cambridge, TESSERA AWS sync done, oi now self-hosts this site, and a new 4C forest leakage preprint appears.

.plan-26-16: Chennai, Cambridge, Belfast: a week on the wingApr 2026

A week of hops between Chennai, Cambridge and Belfast for the FP Launchpad takeoff at IIT Madras, a surprise Publication of the Year at the Cambridge Ring Hall of Fame, meeting the VC on the upcoming Rokos School of Governance, mirroring half a petabyte of TESSERA tiles and hacking on oi

.plan-26-15: Banyan trees, (anti)botnets and Bose-Einstein basesApr 2026

Travelling from Ireland to IIT Madras for the FP Launchpad launch, mirroring half a petabyte of TESSERA embeddings to AWS Open Data, antibotty discussions, and Tangled trust boundaries for AI code review.

.plan-26-14: Tracking AI screen time and escaping to pen and paperApr 2026

Mythos Preview and the urgent need for internet immune systems, cognitive DDoS and AI screen time for code, a proposal for voluntary disclosure in OCaml, desktop focus and printed papers, iOS misery, GeoTessera 0.8, Ceph at 1.4PB, OCaml CI migration, hardware perf counters for OxCaml, and the FP Launchpad launch at IIT Madras.

.plan-26-13: Oxidised, standardised, and syndicatedMar 2026

Publishing the OxCaml Labs year-one review, POSSE and AI content disclosure for the web, adopting the geo-embeddings Zarr convention for TESSERA, action PROPL at PLDI, the death of the grant application, and NASA's new swathe lidar mission.

TESSERA Zarr v3 streaming (take 2)Mar 2026

This is a followon to my first prototype showing a unified Zarr v3 store with consolidated metadata. The video shows the Tessera Zarr Explorer (https;//tze.geotessera.org) illustrating how to explore the TESSERA geospatial embeddings using just a webbrowser, HTTPS streaming and WebGL/wasm to do analyses such as classification and segmentation. See https://geotessera.org for more information

.plan-26-12: Zarr across space and TESSERA timeMar 2026

Reworking the TESSERA Zarr store layout after community feedback, Springer's API woes for evidence synthesis, vibecoding introspection, and git remote helpers for ATProto.

Streaming millions of TESSERA tiles over HTTP with Zarr v3Mar 2026

How we restructured TESSERA's geospatial embeddings from millions of individual numpy files into sharded Zarr v3 stores for efficient HTTP streaming, enabling everything from single-pixel mobile lookups to regional-scale analysis with just a couple of range requests.

.plan-26-09: Browser TESSERA, package management and Docker in the CACMMar 2026

Got TESSERA working in Zarr and the browser, and a preprint of package management a la carte pushed out

Four Ps for Building Massive Collective Knowledge SystemsNov 2025

Design principles for collective knowledge systems—permanence, provenance, permission, and placement—that enable robust networks for evidence-based decision making.

TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and AnalysisNov 2025

Zhengpeng Feng, Clement Atzberger et al.

GeoTessera Python library released for geospatial embeddingsAug 2025

Release of GeoTessera Python library and CLI for accessing TESSERA geospatial foundation model embeddings with interactive visualization tools.

OxCaml LabsJan 2025

TESSERA, a pixelwise geospatial foundation modelJan 2025

Remote Sensing of NatureJan 2023