Streaming millions of TESSERA tiles over HTTP with Zarr v3

Anil Madhavapeddy

doi:10.59350/tk0er-ycs46

Streaming millions of TESSERA tiles over HTTP with Zarr v3

#tessera #spatial #zarr #ai #satellite[cite]·14 Mar 2026

How we restructured TESSERA's geospatial embeddings from millions of individual numpy files into sharded Zarr v3 stores for efficient HTTP streaming, enabling everything from single-pixel mobile lookups to regional-scale analysis with just a couple of range requests.

I've been working on making TESSERA map embeddings even easier to retrieve, so that we can build dynamic user interfaces in the browser or on mobile phones.

When we first released GeoTessera last year, every 0.1° tile was a pair of numpy files; one quantized embedding and one scale array. That worked fine for grabbing a few tiles at a time, but our push for global coverage from 2017-2025 is producing around 1.8 million tiles per year, each weighing in at around 150MB! Serving these over HTTP means a million small directories on disk, and every client that wants a contiguous region needs to discover, fetch and stitch dozens of them, and has a minimum download of 150MB.

To fix this, we need to rethink the structure of the storage entirely: it's quite tricky to support both a small download (e.g. from a mobile phone) and also a large region from a cloud provider. Luckily, there's a new cloud-native streaming format in town that's just the ticket, known as Zarr. Since the GeoTESSERA 0.7 release where we first added basic Zarr support, I've been working on consolidating all our tiles into a single sharded Zarr v3 store per year.

This post explains the TESSERA Zarr conventions proposal and why the chunking size choices matter. I'd also love to get feedback from experienced geospatial gurus, so this post is also an RFC of sorts.

1 Why Zarr v3?

Zarr is a format for large N-dimensional typed arrays designed for cloud object stores. It's great because it allows multidimensional arrays to be accessed via HTTP, meaning that normal S3 or HTTP static servers are sufficient for hosting large datasets.

I built a first prototype a few weeks ago using Zarr v2, and mapped the existing npy tile format we use to it. This collects up batches of 10m2 pixel embeddings into larger tiles, which can be downloaded as a unit (of around 150MB each). The v3 specification (released last year) brings a couple of important new features to improve this:

Sharding a single physical file can contain many logical chunks, indexed by an inline index. This means a client can issue one HTTP range request to get the shard index, then a second byte range to get exactly the chunk it needs. Without sharding, every logical chunk would be a separate file, and so reducing our minimum pixel size to save on downloads for small ROIs (e.g. for mobile devices) would be impractical and involve 100s of millions of tiny files.
Codecs formalise the chain of compression, transposition and serialisation applied to each chunk. Sharding is one such codec, and we also use Blosc/Zstd for all arrays, which gives us reasonable compression ratios on the int8 embeddings. We're never going to get amazing compression ratios of the TESSERA embeddings because they are high entropy (we reduce 1000s of dimensions into 128 during the training and inference process), but there's still some win to be had.

The Python zarr library has reasonably solid v3 support now, and in my tests the wider ecosystem such as xarray, dask, rioxarray can all read these v3 stores without issue. So I think we're good to use v3 features now!

2 The store layout

Each embeddings year gets a single Zarr store. Within it, each UTM zone is a group that contains that particular strip of the planet's embeddings and scales. For clients to visualise what's going on, there's an optional global RGB preview group:

2024.zarr/
    zarr.json                 # root: version, year, conventions
    utm29/                    # one group per UTM zone
        embeddings            # int8    (H, W, 128)
        scales                # float32 (H, W)
        rgb                   # uint8   (H, W, 4)     [optional]
        easting               # float64 (W,)
        northing              # float64 (H,)
        band                  # int32   (128,)
    utm30/
        ...
    global_rgb/               # EPSG:4326 preview pyramid
        0/rgb                 # uint8   (H, W, 4)     level 0
        1/rgb                 #                       level 1
        2/rgb                 #                       level 2
        ...

Using one store per year rather than per zone allows us to use an experimental consolidated metadata feature. A single zarr.consolidate_metadata() call gives a client the full catalog of zones, their spatial extents, and which arrays exist. This (I think) eliminates the need for the Parquet registry we currently maintain for TESSERA.

Like the current npy embeddings, each zone group carries their own CRS to minimise coordinate skew. Each zone has a proj:code attribute (e.g. EPSG:32630) and a spatial:transform giving the affine matrix. Southern hemisphere zones use the canonical northern-hemisphere EPSG code with the 10,000,000m false northing subtracted, so the northing axis is continuous.

Coordinate arrays (easting, northing, band) are small 1-D arrays stored alongside the data, so an xarray open_zarr just works with labeled axes. These are labeled as dimension names so that xarray or other clients can pick them up automatically.

3 Sharding and chunking

The TESSERA embeddings are quite large in aggregate, and that is where most of the design time went. TESSERA clients have three very different access patterns:

A single-pixel lookup or a small region-of-interest means that a user has a lon/lat and wants the 128-d embedding vector at that point. This should be ~2KB over HTTP. This might be a mobile device aiming to do active learning, for example.
A regional subset means that a user wants a spatial rectangle (say, 100km2) of all 128 bands. This should stream efficiently without reading the whole zone and mosaicing it in memory (a source of memory problems currently). This might be a desktop analysis, or even a satellite scanning a region.
A scan of entire countries to do global analyses, which requires terabytes of downloads to retrieve the full set of embeddings.

We will come back to solve the third 'entire countries' problem later, via a new variant of the model we are training that uses Matryoshka embeddings. However, the first two also pull in opposite directions but are needed for mobile clients vs chunkier desktop analysis tools. Zarr v3 sharding resolves both by letting us create shards:

Shard: 256 × 256 pixels (aligned to tile boundaries)
Chunk: 4 × 4 pixels (inner chunk within each shard)

Each shard is a single file on disk or object in S3, containing a grid of 64×64 inner chunks plus a ~32KB shard index at the end. To read a single pixel the client has to:

Fetch the shard index via one HTTP range request of ~32KB, which is cacheable.
Compute which 4×4 inner chunk contains the pixel.
Fetch that chunk with another HTTP range request that's ~2KB for int8×128, and can reuse the previous HTTP connection via pipelining.

For a single pixel read, there's a bit of extra overhead from the index, but tolerable. For a regional read, the client fetches whole shards and gets contiguous 256×256 blocks, which is efficient for downstream processing.

Another neat thing about Zarr is that we can have multiple data type arrays. TESSERA uses a quantisation trick to compress the embeddings with a 'scale' array, which is a float32 held alongside the 128-dimension int8 values. For this array we also use the same 256/4 sharding, and also signal that there's no data via a NaN scale. This lets us skip the need for the landmask TIFFs we currently maintain.

After this the global RGB preview is plain sailing as its more like a conventional visual map tile, and uses plain 512×512 chunks with no sharding since the pyramid levels get small quickly and the access pattern is always tile-aligned for map rendering. These previews can also be reprojected by the client for dynamic maps.

4 GeoZarr conventions

The GeoZarr spec is still under active development, and has a conventions mechanism where stores declare which metadata schemas they follow. We use three:

proj: is the CRS information formerly held in our landmask TIFFs. Each zone group carries a proj:code (e.g. "EPSG:32630") and proj:wkt2 for the full WKT2 string.
spatial: is the affine spatial coordinate transforms. Each zone group has spatial:transform (the 6-element affine), spatial:dimensions, spatial:shape, spatial:bbox and spatial:registration. This can be calculated fairly easily from the CRS, but included here so that clients that know the spatial Zarr convention can just query this and use it directly.
multiscales is the pyramid layout for the global preview, compatible with the approach used by topozarr.

These conventions are registered in the root zarr.json attributes as an array, following the ZEP for conventions. This makes the stores more self-describing as any Zarr-aware tool can read the conventions list and know what metadata keys to expect.

In order to join the Zarr specification party, I've created zarr-convention-tessera to crystallise the conventions I've used in TESSERA, such as the utm zone splitting and quantisation bands. Once we're happy with this format, existing libraries like geotessera and also the upcoming OCaml geotessera can all switch to Zarr streaming instead.

The first TESSERA zarr prototype showing the multiscale pyramid. The vertical lines in the map are debug markers to delimit UTM zones. Conveniently Cambridge is split right down the middle of two!

5 Building the TESSERA Zarr stores

The geotessera-registry CLI now has commands in development for this pipeline. It's unfortunately a very computationally heavy job, since the input npy tiles have to be rearranged and rewritten one by one into the Zarr format, and then the RGB pyramids calculated. This is reasonable to parallelise, but we're a little stuck on our university storage cluster due to relatively slow network interconnects at the moment.

Figuring out this conversion bottleneck is top of my list next week; in particular, if you have any leads on a cloud storage provider that may like to sponsor a petabyte or two of S3 storage, I'm all ears!

One other useful thing is that we generate a STAC catalog that provides a standards-compliant discovery layer: one STAC collection per year. This lets us use tile servers and hopefully eventually STAC-D pipelines to integrate these into planetary computing pipelines.

6 What's next

We won't stop serving the npy files for some time, since we have a number of users already committed to those, and that workflow is fine for regional analysis. However, I'm keen to unlock mobile workflows as there's a lot of demand for this (especially after the TESSERA hackathon in Delhi), so we'll push forward with Zarr. In particular thank you to Deepak Cherian for giving me lots of Zarr advice on our Zulip channel.

While this spec is out for review, here's a sneak peek of a TESSERA Zarr web viewer that reads directly from the Zarr stores into the browser, with no server required. I'm also working on an access library in OxCaml using Mark Elvers OCaml Zarr library so that we can use these from our native pipeline too. This would also make it much easier to integrate TESSERA into the biodiversity monitoring standards framework that we've been working on.

I also discovered that there had been a very relevant vector embeddings hackathon held a few days ago at Clark University. They came up with a STAC for embeddings proposal that I've left an query on as well, to make sure our work is compatible.

References

[1]Madhavapeddy (2026). Connecting the dots for biodiversity action from the NAS/Royal Society Forum. 10.59350/dy7d3-hdt43

[2]Madhavapeddy (2026). .plan-26-10: Streaming TESSERA working, biodiversity action papers, and FPL takes off. 10.59350/re0zy-3rt26

[3]Madhavapeddy (2026). 1st TESSERA/CoRE hackathon at the Indian AI Summit. 10.59350/1na80-7ak85

[4]Madhavapeddy (2025). GeoTessera 0.7 out with efficient sampling and Zarr support. 10.59350/nagwp-tnw89

[5]Madhavapeddy (2025). GeoTessera Python library released for geospatial embeddings. 10.59350/7hy6m-1rq76

[6]Laud et al (2025). STACD: STAC Extension with DAGs for Geospatial Data and Algorithm Management. 10.1145/3759536.3763803

A scorching CNG London during Climate Action WeekJun 2026

My notes from the first Cloud-Native Geospatial Forum gathering outside the US, up on the top floor of the Jellicoe; covering Source Cooperative's open data economics, Argentina's invisible settlements, and provenance and trust for geospatial decisionmaking.

Rewilding the Web: my workshop report from EdinburghMay 2026

Notes from a wonderfully interdisciplinary Edinburgh workshop on 'Rewilding the Web', ranging coopetition and biological variety through the philosophy of self-organisation, polycrisis governance, protopian science fiction, and moderation seen through the lens of artisanal cheese.

AI, science and the UK–EU relationship at the Royal SocietyApr 2026

Notes from a Royal Society policy meeting with the European Commission on responsible AI, interoperable data and UK–EU alignment in AI for science; covering AI-poisoned literature, federated TESSERA-scale infrastructure, disclosure standards and the practical value of sustained UK–EU dialogue.

.plan-26-16: Chennai, Cambridge, Belfast: a week on the wingApr 2026

A week of hops between Chennai, Cambridge and Belfast for the FP Launchpad takeoff at IIT Madras, a surprise Publication of the Year at the Cambridge Ring Hall of Fame, meeting the VC on the upcoming Rokos School of Governance, mirroring half a petabyte of TESSERA tiles and hacking on oi

.plan-26-15: Banyan trees, (anti)botnets and Bose-Einstein basesApr 2026

Travelling from Ireland to IIT Madras for the FP Launchpad launch, mirroring half a petabyte of TESSERA embeddings to AWS Open Data, antibotty discussions, and Tangled trust boundaries for AI code review.

.plan-26-14: Tracking AI screen time and escaping to pen and paperApr 2026

Mythos Preview and the urgent need for internet immune systems, cognitive DDoS and AI screen time for code, a proposal for voluntary disclosure in OCaml, desktop focus and printed papers, iOS misery, GeoTessera 0.8, Ceph at 1.4PB, OCaml CI migration, hardware perf counters for OxCaml, and the FP Launchpad launch at IIT Madras.

.plan-26-13: Oxidised, standardised, and syndicatedMar 2026

Publishing the OxCaml Labs year-one review, POSSE and AI content disclosure for the web, adopting the geo-embeddings Zarr convention for TESSERA, action PROPL at PLDI, the death of the grant application, and NASA's new swathe lidar mission.

TESSERA now supports the Zarr geo-embeddings convention proposalMar 2026

Community feedback reshaped our Zarr store layout — years became a dimension, shards got bigger, and we retired the TESSERA-specific convention in favour of a shared geo-embeddings standard that also covers other models.

.plan-26-12: Zarr across space and TESSERA timeMar 2026

Reworking the TESSERA Zarr store layout after community feedback, Springer's API woes for evidence synthesis, vibecoding introspection, and git remote helpers for ATProto.

.plan-26-11: Bins, bollards, bots and biodiversity boffinsMar 2026

Evidence synthesis at the DEFRA science conference, TESSERA transcoding and building a new SPA, OpenStreetMap/DuckDB bindings in OxCaml, and early thoughts on vibecoding etiquette.

Tessera Zarr streaming previewMar 2026

A little screencast of a fully browser based streaming interface to manipulate TESSERA embeddings. All the classification and UMAPs run directly in a browser, with no server required aside from static HTTP serving of the embeddings!

.plan-26-10: Streaming TESSERA working, biodiversity action papers, and FPL takes offMar 2026

TESSERA streaming in the browser, planetary programming at WG2.8, biodiversity action papers, FP Launchpad opens, and Docker CACM buzz

Connecting the dots for biodiversity action from the NAS/Royal Society ForumMar 2026

Summary of the Nine Recommendations and Biodiversity Monitoring Standards Framework papers from the NAS/Royal Society US-UK Forum in summer 2025, and how they connect to my work on collective knowledge systems, TESSERA, and evidence synthesis.

.plan-26-09: Browser TESSERA, package management and Docker in the CACMMar 2026

Got TESSERA working in Zarr and the browser, and a preprint of package management a la carte pushed out

Is Running Untrusted Code on a Satellite a Good Idea?Feb 2026

Thomas Gazagnaire. The same conversation keeps happening. I explain what Parsimoni does (run third-party software on someone else's satellite) and the response is always some variant of: "I would never trust code I have not reviewed to run on my satellite." They are right to worry -- I would say the same thi…

1st TESSERA/CoRE hackathon at the Indian AI SummitFeb 2026

First TESSERA hackathon held at the Indian AI Impact Summit in Delhi, exploring integration with IIT-Delhi's CoRE Stack for geospatial analysis and testing TESSERA labeling workflows.

Tile ServerDec 2025

Mark Elvers. My throw-away comment at the end of my earlier post shows my scepticism that the JSON file approach was really viable.

GeoTessera 0.7 out with efficient sampling and Zarr supportNov 2025

GeoTessera 0.7 switches to GeoParquet manifests for faster initialisation, adds Zarr tensor storage support, and provides new sampling APIs for building downstream tasks like solar panel detection.

GeoTessera Python library released for geospatial embeddingsAug 2025

Release of GeoTessera Python library and CLI for accessing TESSERA geospatial foundation model embeddings with interactive visualization tools.

OxCaml LabsJan 2025

TESSERA, a pixelwise geospatial foundation modelJan 2025

Remote Sensing of NatureJan 2023

Planetary ComputingJan 2022