Autoscaling geospatial computation with Python and Yirgacheffe / Apr 2025
This is an idea proposed in 2025 as a good starter project, and is available for being worked on. It may be co-supervised with Michael Dales.
Python is a popular tool for geospatial data-science, but it, along with the GDAL library, handle resource management poorly. Python does not deal with parallelism well and GDAL can be a memory hog when parallelised. Geo-spatial workloads -- working on global maps at metre-level resolutions -- can easily exceed the resources available on a given host when run using conventional schedulers.
To that end, we've been building Yirgacheffe, a geospatial library for Python that attempts to both hide the tedious parts of geospatial work (aligning different data sources for instance), but also tackling the resource management issues so that ecologists don't have to also become computer scientists to scale their work. Yirgacheffe can:
- chunk data in memory automatically, to avoid common issues around memory overcommitment
- can do limited forms of parallelism to use multiple cores.
Yirgacheffe has been deployed in multiple geospatial pipelines, underpinning work like Mapping LIFE on Earth, as well as an implementation of the IUCN STAR metric, and a methodology for assessing tropical forest interventions. […453 words]
An access library for the world crop, food production and consumption datasets / Apr 2025
This is an idea proposed in 2025 as a good starter project, and is available for being worked on. It may be co-supervised with Alison Eyres and Thomas Ball.
Agricultural habitat degradation is a leading threat to global biodiversity. To make informed decisions, it's crucial to understand the biodiversity impacts of various foods, their origins, and potential mitigation strategies. Insights can drive actions from national policies to individual dietary choices. Key factors include knowing where crops are grown, their yields, and food sourcing by country.
The FAOSTAT trade data offers comprehensive import and export records since 1986, but its raw form is complex, including double counting, hindering the link between production and consumption. […372 words]
A hardware description language using OCaml effects / Mar 2025
This is an idea proposed in 2025 as a Cambridge Computer Science Part III or MPhil project, and is available for being worked on. It may be co-supervised with KC Sivaramakrishnan and Andy Ray.
Programming FPGAs using functional programming languages is a very good fit for the problem domain. OCaml has the HardCaml ecosystem to express hardware designs in OCaml, make generic designs using the power of the language, then simulate designs and convert them to Verilog or VHDL.
HardCaml is very successfully used in production at places like Jane Street, but needs quite a lot of prerequisite knowledge about the full OCaml language. In particular, it makes very heavy use of the module system in order to build up the circuit description as an OCaml data structure.
Instead of building up a circuit as the output of the OCaml program, it would be very cool if we could directly implement the circuit as OCaml code by evaluating it. This is an approach that works very successfully in the Clash Haskell HDL, as described in this thesis. Clash uses a number of advanced Haskell type-level features to encode fixed-length vectors (very convenient for hardware description) and has an interactive REPL that allows for exploration without requiring a separate test bench. […296 words]
Using computational SSDs for vector databases / Feb 2025
This is an idea proposed in 2025 as a Cambridge Computer Science Part III or MPhil project, and is available for being worked on. It may be co-supervised with Sadiq Jaffer.
Large pre-trained models can be used to embed media/documents into concise vector representations with the property that vectors that are "close" to each other are semantically related. ANN (Approximate Nearest Neighbour) search on these embeddings is used heavily already in RAG systems for LLMs or search-by-example for satellite imagery.
Right now, most ANN databases almost exclusively use memory-resident indexes to accelerate this searching. This is a showstopper for larger datasets, such as the terabytes of PDFs we have for our big evidence synthesis project, each of which generates dozens of embeddings. For global satellite datasets for remote sensing of nature at 10m scale this is easily petabytes per year (the raw data here would need to come from tape drives). […398 words]
Affordable digitisation of insect collections using photogrammetry / Feb 2025
This is an idea proposed in 2025 as a Cambridge Computer Science Part III or MPhil project, and is currently being worked on by Beatrice Spence and Arissa-Elena Rotunjanu. It is co-supervised with Tiffany Ki and Edgar Turner.
Insects dominate animal biodiversity and are sometimes called "the little things that run the world". They play a disproportionate role in ecosystem functioning, are highly sensitive to environmental change and often considered to be early indicators of responses in other taxa. There is widespread concern about global insect declines[^1] yet the evidence behind such declines is highly biassed towards the Global North and much is drawn from short-term biodiversity datasets[^2] [^3].
The Insect Collection at the University Museum of Zoology, Cambridge holds over 1.2 million specimens. These include specimens collected from the early 19th century to the present day. Most specimens remain undocumented and unavailable for analysis. However, they contain data that are critical to understanding long-term species and community responses to anthropogenic change, and vital to evaluating whether short-term declines are representative of longer-term trends[^4] [^5]. As such, unlocking these insect collections is of paramount importance, and the large-scale nature of these collections necessitates the development of an efficient and effective digitisation process.
The 3D digitisation of specimens using current methods is either highly time-intensive or expensive, rendering it impossible to achieve across the collection in a reasonable time-frame. Yet, 3D models of specimens have huge potential for investigating species morphological responses to anthropogenic changes over time and identification of trade-offs in morphological responses within a 3D morphospace. […540 words]
Parallel traversal effect handlers for OCaml / Sep 2024
This is an idea proposed in 2024 as a Cambridge Computer Science Part II project, and is currently being worked on by Sky Batchelor. It is co-supervised with Patrick Ferris.
Most existing uses of effect handlers perform synchronous execution of handled
effects. Xie et al proposed a traverse
handler for parallelisation of
independent effectful computations whose effect handlers are outside the
parallel part of the program. The paper [^1] gives a sample implementation as a
Haskell library with an associated λp calculus that formalises the parallel
handlers. […162 words]
Gradually debugging type errors / Sep 2024
This is an idea proposed in 2024 as a Cambridge Computer Science Part II project, and is currently being worked on by Max Carroll. It is co-supervised with Patrick Ferris.
Reasoning about type errors is very difficult, and requires shifting between static and dynamic types. In OCaml, the type checker asserts ill-typedness but provides little in the way of understanding why the type checker inferred such types. These direct error messages are difficult to understand even for experienced programmers working on larger codebases.
This project will explore how to use gradual types to reason more effectively about such ill-typed programs, by introducing more dynamic types to help some users build an intuition about the problem in their code. The intention is to enable a more exploratory approach to constructing well-typed programs. […131 words]
Using wasm to locally explore geospatial layers / Aug 2024
This is an idea proposed in 2024 as a Cambridge Computer Science Part II project, and is currently being worked on by Sam Forbes. It is co-supervised with Michael Dales.
Some of my projects like Mapping LIFE on Earth or Remote Sensing of Nature involve geospatial base maps with gigabytes or even terabytes of data. This data is usually split up into multiple GeoTIFFs, each of which has a slice of information. For example, the LIFE persistence maps have around 30000 maps for individual species, and then an aggregated GeoTIFF for mammals, birds, reptiles and so forth.
This project will explore how to build a WebAssembly-based visualisation tool for geospatial ecology data. This existing data is in the form of GeoTIFF files, which are image files with embedded georeferencing information. The application will be applied to files which include information on the prevalence of species in an area, consisting of a global map at 100 m2 scale. An existing tool, QGIS, allows ecologists to visualise this data across the entire world, collated by types of species, but this is difficult to work with because of the scale of the data involved. […341 words]
Towards reproducible URLs with provenance / Aug 2024
This is an idea proposed in 2024 as a Cambridge Computer Science Part II project, and is available for being worked on. It may be co-supervised with Patrick Ferris.
Vurls are an attempt to add versioning to URI resolution. For example, what should happen when we request https://doi.org/10.1109/SASOW.2012.14
and how do we track the chain of events that leads to an answer coming back? The prototype vurl library written in OCaml outputs the following: […323 words]
Real-time mapping of changes in species extinction risks / Aug 2024
This is an idea proposed in 2024 as a Cambridge Computer Science PhD topic, and is currently being worked on by Emilio Luz-Ricca. It is co-supervised with Andrew Balmford.
Loss of habitat represents the most significant threat to wildlife overall, but advances in satellite sensing have enabled the assessment of habitat extent with comprehensive spatial coverage and reasonable temporal resolution. To address rising demand for metrics to quantify biodiversity, we have developed the LIFE metric (see Mapping LIFE on Earth) that models the effect of landuse changes on species extinction risk as a function of Areas of Habitat (AoH).
This PhD work explores how to deal with the anthropogenic threats beyond simple habitat loss, including hunting, agricultural practices, and the introduction of invasive species. These additional threatening processes degrade habitat quality and lower species occupancy, but are extremely difficult to observe directly via remote sensing. This project will therefore involve a combination of modelling, machine learning and remote sensing data analysis to understand the impact of these additional anthropogenic threats on habitat quality on a per-species basis.
Mapping hunting risks for wild meat in protected areas / Aug 2024
This is an idea proposed in 2024 as a postdoctoral project, and is currently being worked on by Charles Emogor. It is co-supervised with Milind Tambe.
There is an important balance needed between the biodiversity damage caused by hunting in protected areas and the well-being of local communities that depend on it. One understudied driver of overly damaging hunting in these areas is snaring (as opposed to gun hunting) which potentially increases carcass wastage and hence causing biodiversity harm without proportionate benefit to the community.
This project examines how to improve the efficacy of anti-poaching ranger patrols while also plugging the knowledge gap around wild meat snaring. Both of these research topics can be tackled in a new light with the emergence of machine learning as a data-driven approach to deriving insights from sparse data, and particularly from some of the newer base maps being developed in our Mapping LIFE on Earth project.
Implementing a higher-order choreographic language / Aug 2024
This is an idea proposed in 2024 as a Cambridge Computer Science Part II project, and has been completed by Rokas Urbonas. It was co-supervised with Dmirtij Szamozvancev.
This project aims to implement a functional choreographic language inspired by the Pirouette calculus. This language was meant to make the notoriously difficult process of implementing distributed algorithms easier, while offering a practical execution model for multi-participant programs. Additionally, it aimed to match the expressiveness and performance of similar existing solutions.
The project completed very successfully, and resulted in ChorCaml, an embedded DSL for choreographic programming in OCaml. The language facilitates the implementation of distributed algorithms, while offering a clear syntax and safety via the type system. ChorCaml also improves upon existing alternatives in certain common use cases, both in terms of program conciseness and performance. The practicality of the DSL was verified by successfully implementing well-known distributed algortihms such as Diffie-Hellman key exchange and concurrent Karatsuba fast integer multiplication. […163 words]
Foundation models for complex geospatial tasks / Aug 2024
This is an idea proposed in 2024 as a Cambridge Computer Science PhD topic, and is currently being worked on by Onkar Gulati. It is co-supervised with Sadiq Jaffer and David A Coomes.
Self-supervised learning (SSL) represents a shift in machine learning that enables versatile pretrained models to leverage the complex relationships present in dense–oftentimes multispectral and multimodal–remote sensing data. This in turn can accelerate how we address sophisticated downstream geospatial tasks for which current methodologies prove insufficient, ranging from land cover classification to urban building segmentation to crop yield measurement and wildfire forecasting.
This PhD project explores the question of how current SSL methodologies may be altered to tackle remote sensing tasks, and also how to make them amenable to incremental time-series generation as new data regularly comes in from sensing instruments.
An imperative, pure and effective specification language / Aug 2024
This is an idea proposed in 2024 as a Cambridge Computer Science Part II project, and is currently being worked on by Max Smith. It is co-supervised with Patrick Ferris.
Formal specification languages are conventionally rather functional looking, and not hugely amenable to iterative development. In contrast, real world specifications for geospatial algorithms tend to developed with "holes" in the logic which is then filled in by a domain expert as they explore the datasets through small pieces of exploratory code and visualisations.
This project seeks to investigate the design of a specification language that looks and feels like Python, but that supports typed holes and the robust semantic foundations of a typed functional language behind the hood. The langage would have a Python syntax, with the familiar imperative core, but translate it into Hazel code behind the scenes. […217 words]
Low-latency wayland compositor in OCaml / May 2024
This is an idea proposed in 2024 as a Cambridge Computer Science Part II project, and is currently being worked on by Tom Thorogood. It is co-supervised with Ryan Gibb.
When building situated displays and hybrid streaming systems, we need fine-grained composition over what to show on the displays. Wayland is a communications protocol for next-generation display servers used in Unix-like systems.[^0]
It has been adopted as the default display server by Linux distributions including Fedora with KDE, and Ubuntu and Debian with GNOME. It aims to replace the venerable X display server with a modern alternative. X leaves logic such as window management to application software, which has allowed the proliferation of different approaches. Wayland, however, centralizes all this logic in the 'compositor', which assumes both display server and window manager roles.[^1] […267 words]
Displaying the 15 most recent items out of 62 in total (see all the items).