Autoscaling geospatial computation with Python and Yirgacheffe
This is an idea proposed in 2025 as a good starter project, and is available for being worked on. It may be co-supervised with Michael Dales.
Python is a popular tool for geospatial data-science, but it, along with the GDAL library, handle resource management poorly. Python does not deal with parallelism well and GDAL can be a memory hog when parallelised. Geo-spatial workloads -- working on global maps at metre-level resolutions -- can easily exceed the resources available on a given host when run using conventional schedulers.
To that end, we've been building Yirgacheffe, a geospatial library for Python that attempts to both hide the tedious parts of geospatial work (aligning different data sources for instance), but also tackling the resource management issues so that ecologists don't have to also become computer scientists to scale their work. Yirgacheffe can:
- chunk data in memory automatically, to avoid common issues around memory overcommitment
- can do limited forms of parallelism to use multiple cores.
Yirgacheffe has been deployed in multiple geospatial pipelines, underpinning work like Mapping LIFE on Earth, as well as an implementation of the IUCN STAR metric, and a methodology for assessing tropical forest interventions.
The summer project
Whilst Yirgacheffe solves some of the resource management problems involved in geospatial coding, it does so conservatively and statically. It does not currently assess the current state of the host on which it is being run: how much memory or how many CPU cores are free? How much memory is each thread using? How to react if someone else fires up a big job on the same machine?
If it gets this wrong via overcommitting resources, then the dreaded the Linux OOM killer can (at best) take down your job or (at worst) take down the entire system including other users' work. Therefore, we want Yirgacheffe to be more clever about scaling up resource usage on a large host, without compromising overall system stability.
In this project we'd like to:
- Add the ability to better estimate how much memory and CPU is free at the start of day to set sensible defaults rather than the current highly conservative estimates
- Add the ability to adjust those values based on reaction to current machine state
- Demonstrate that this works by applying it to one of the existing pipelines and demonstrating better resource utilisation on a big but busy compute server (you get to play with 256 core hosts with a terabyte of RAM!)
This would be a good summer project for a student interested both operating systems and scientific computing, looking to help work on enabling real sustainability and environmental research.
For background reading:
- Michael Dales posts a blog on Yirgacheffe
- A future of coding thread with some discussion
You can also watch a (slightly tangential but on the same topic of geospatial processing) talk from Michael Dales at LOCO24.
Related News
- PACT Tropical Moist Forest Accreditation Methodology v2.1 / Aug 2024
- Mapping LIFE on Earth / Jan 2023
- Planetary Computing / Jan 2022