TESSERA now supports the Zarr geo-embeddings convention proposal

Community feedback reshaped our Zarr store layout — years became a dimension, shards got bigger, and we retired the TESSERA-specific convention in favour of a shared geo-embeddings standard that also covers other models.

The ink's barely dry on my last RFC post about streaming millions of TESSERA tiles over HTTP using Zarr v3 sharding, and it's out of date! The store layout I'd carefully craftd lasted about 48 hours before a discussion from Jeff Albrecht and Len Strnad convinced me to rethink my approach.

This led me to build a "MegaZarr" unified store, and retire our TESSERA-specific convention to instead integrate with a shared geo-embeddings Zarr convention that's appeared in the last week. Things are moving really quickly in the world of geospatial embeddings; I'll explain some of the gory details about the store here, and here's a video to show it in action or try it for yourself:

1 Arise, the MegaZarrrrr

The original layout had a separate Zarr store per year, an HWB array layout, and 256-pixel shards with very fine tile granularity. After the discussions, instead of a Zarr store per year, we now have a single store with time as a first-class dimension, which I dubbed the MegaZarr. The unified array shape is now (T, B, H, W) or NCHW ordering which is easier to use in ML frameworks like PyTorch. Each year's data lives in its own temporal slice, and the time coordinate array stores the years explicitly so it can just be indexed by the integer year directly.

You can see this in the video above in the new 'explorer' mode, whereby you can just go to https://tze.geotessera.org and click around individual tiles and 10m² pixels and see the embeddings all the way from 2017 to 2025 where available. We can't easily prepend years to an existing store without rewriting the whole thing, so we zero-fill the entire year range upfront and selectively replace years as they're generated. We use dual sentinels in the scales array to do this via +inf for "land but no data yet" and NaN for water areas that don't have valid embeddings. This lets a single isfinite() call identify valid terrestrial pixels in a tile.

Switching from HWB to NCHW format also aligns with how ML frameworks expect data since all 128 bands for a pixel are contiguous in a single inner chunk, which is what downstream tasks usually need. The common operation of grabbing a single year is fine since the time dimension coming first means temporal slicing is cheap.

I've also tweaked the shard/chunk balancing sizes to a larger 4096×4096 shard size vs the original 256×256, and a slightly larger inner chunk of (1,128,32,32). There are still only two HTTP range requests for a single pixel, but now each shard covers a larger (around 40km²) area with far fewer files on the server. The TZE explorer confirms this still works reasonably well on mobile.

Mark Elvers is installing SSDs today by the bushel so we can keep up with all the storage requirements here. We do need cloud hosting for all this Zarr as we're over a petabyte raw now in the University and it's nervous work...
Mark Elvers is installing SSDs today by the bushel so we can keep up with all the storage requirements here. We do need cloud hosting for all this Zarr as we're over a petabyte raw now in the University and it's nervous work...

2 New geotessera tooling around Zarr

All this also means that all my hard work on geotessera will also soon be obsolete! A quickstart to grab embeddings using just xarray/Zarr from Python looks like

# params, year (int), x, y
ds = xr.open_zarr(args.store, group=f"utm{zone:02d}",
       zarr_format=3, consolidated=True,
       chunks={"time": 1, "band": 128, "y": 4096, "x": 4096})
pixel = ds.sel(time=year, x, y, method="nearest")
emb_int8 = pixel["embeddings"].values
scale = float(pixel["scales"].values)
embedding = emb_int8.astype(np.float32) * scale
for i in range(min(16, len(embedding))):
  print(f"{i:4d}  {int(emb_int8[i]):5d}  {embedding[i]:10.6f}")

A few things of note here are that the use of coordinate arrays means that you can index into the Zarr array using the natural year integer and the UTM x/y coordinates. After that you dequantise using the scales array, and the dimensions are in an array for you.

Now, there is still some useful housekeeping whereby Geotessera will continue to be useful for Zarr. Firstly, it can provide coordinate lookups and also a name resolver to find a copy of the embeddings (we are getting offers to mirror and version from several other people, which is great). Secondly, the biggest downside to Zarr is that it doesn't have a quick index of sparsity, which means that coverage tests can be expensive in terms of HTTP HEAD requests. So I suspect I'll keep the existing GeoParquet tile registry around to provide a quick lookup for clients that want to test large-scale tile existence. However, downloading this index will be optional so clients can get started immediately without an expensive database check.

Using the new wrapper (which I'm tidying up for a release) means you just need to:

from geotessera.store import GeoTesseraZarr

gt = GeoTesseraZarr()  # defaults to public store
emb = gt.sample_at(-2.97, 53.44, year=2025)          # (128,) float32
mosaic, transform = gt.read_region(bbox, year=2025)  # (H, W, 128)

3 From a TESSERA convention to a community one

While I was iterating on the store layout, a Clark University hackathon on geo-embeddings created an embeddings-stac-specification repository to connect STAC with vector embeddings. We had a productive discussion about how TESSERA's QAT-compressed embeddings differ from other models.

There's also another geo-embeddings Zarr convention, which is a model-agnostic standard for describing geospatial embeddings in Zarr stores. It currently has examples for Clay (768-dim chip embeddings on a regular grid) and AEF (64-dim patch embeddings with global quantisation).

To make TESSERA fit into this, I proposed several additions to the convention:

  1. Per-pixel scale quantization ("method": "per_pixel_scale"). Unlike the other models, TESSERA uses quantization-aware training with a per-pixel scale array, and not a single global scale. The convention therefore now supports a scale_array field pointing to the array name and a nodata field for sentinel values.

  2. UTM zone spatial layout ("spatial_layout": "utm_zones"). In order to minimise distortion due to varying pixel sizes at different points of the earth, TESSERA stores each UTM zone as a separate group rather than reprojecting to a global grid. This is a bit less seamless than the simpler global layout, but prioritises accuracy until we can measure the effect of the distortion. We use Zarr groups for this, so the root store has subgroups with utmXY that the client opens as needed.

  3. Multiple source datasets. TESSERA fuses Sentinel-1 (radar) and Sentinel-2 (optical), so geoemb:source_data now accepts an array of URLs rather than just one.

  4. Build version tracking. We're going to be releasing multiple variations and models of our embeddings as the technology advances, so geoemb:build_version also records which version of the pipeline produced the store. I'm hoping to use this to record end-to-end provenance for every single embedding.

See the TESSERA example JSON for all this in one place. With the geoemb: convention covering everything we need, the zarr-convention-tessera repo can now be retired after a glorious two weeks of life. The Zarr web inspector confirms our live store is compliant, except for those in-flight proposals that still need to be ratified.

4 What's next

Thanks to everyone in the Zarr community for their really useful input. I wouldn't be surprised if we need to tweak my (newbie) layout a bit more yet, but comments are welcome on my PR. Once things settle, I'm back to working on the OxCaml Zarr library to add the sharding support I need.

Also to throw a bit of a clanker into my delicate Zarr work, we're working on a revision of the TESSERA model that supports Matryoshka representations, which means I need to figure out how to support slicing by dimension as well. After all this careful work in v1.0 to make the embeddings contiguous in memory, we will soon need to be able to get just the first few dimensions out of 128 to do quick sample analyses. Code doesn't survive very long any more with all the progress in training!

The live Zarr v3 TESSERA store is browsable now at the TZE explorer, and the embeddings convention is open for feedback at geo-embeddings/embeddings-zarr-convention.

References

[1]Feng et al (2025). TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis. arXiv. 10.48550/arXiv.2506.20380
[2]Madhavapeddy (2026). Streaming millions of TESSERA tiles over HTTP with Zarr v3. 10.59350/tk0er-ycs46
[3]Madhavapeddy (2025). Four Ps for Building Massive Collective Knowledge Systems. 10.59350/418q4-gng78
[4]Madhavapeddy (2025). GeoTessera Python library released for geospatial embeddings. 10.59350/7hy6m-1rq76
[5]Kusupati et al (2024). Matryoshka Representation Learning. arXiv. 10.48550/arXiv.2205.13147