Using computational SSDs for vector databases / Jan 2025
This is an idea proposed as a Cambrige Computer Science Part III or MPhil project, and is available for being worked on. It will be supervised by Anil Madhavapeddy and Sadiq Jaffer.
Large pre-trained models can be used to embed media/documents into concise vector representations with the property that vectors that are "close" to each other are semantically related. ANN (Approximate Nearest Neighbour) search on these embeddings is used heavily already in RAG systems for LLMs or search-by-example for satellite imagery.
Right now, most ANN databases almost exclusively use memory-resident indexes to accelerate this searching. This is a showstopper for larger datasets, such as the terabytes of PDFs we have for our big evidence synthesis project, each of which generates dozens of embeddings. For global satellite datasets for remote sensing of nature at 10m scale this is easily petabytes per year (the raw data here would need to come from tape drives). […398 words]
Using wasm to locally explore geospatial layers / Jan 2024
This is an idea proposed as a Cambridge Computer Science Part II project, and is currently being worked on by Sam Forbes. It is supervised by Michael Dales and Anil Madhavapeddy.
Some of my projects like Mapping LIFE on Earth or Remote Sensing of Nature involve geospatial base maps with gigabytes or even terabytes of data. This data is usually split up into multiple GeoTIFFs, each of which has a slice of information. For example, the LIFE persistence maps have around 30000 maps for individual species, and then an aggregated GeoTIFF for mammals, birds, reptiles and so forth.
This project will explore how to build a WebAssembly-based visualisation tool for geospatial ecology data. This existing data is in the form of GeoTIFF files, which are image files with embedded georeferencing information. The application will be applied to files which include information on the prevalence of species in an area, consisting of a global map at 100 m2 scale. An existing tool, QGIS, allows ecologists to visualise this data across the entire world, collated by types of species, but this is difficult to work with because of the scale of the data involved. […341 words]
Towards reproducible URLs with provenance / Jan 2024
This is an idea proposed as a Cambridge Computer Science Part II project, and is available for being worked on. It will be supervised by Patrick Ferris and Anil Madhavapeddy.
Vurls are an attempt to add versioning to URI resolution. For example, what should happen when we request https://doi.org/10.1109/SASOW.2012.14
and how do we track the chain of events that leads to an answer coming back? The prototype vurl library written in OCaml outputs the following: […323 words]
Spatial and multi-modal extraction from conservation literature / Jan 2024
This is an idea proposed as a Cambrige Computer Science Part III or MPhil project, and is available for being worked on. It will be supervised by Anil Madhavapeddy, Sadiq Jaffer, Alec Christie and Bill Sutherland.
The Conservation Evidence Copilots database contains information on numerous conservation actions and their supporting evidence. We also have access to a large corpus of academic literature detailing species presence and threats which we have assembled in Cambridge in collaboration with the various journal publishers.
This MPhil project aims to combine these published literature resources with geographic information to propose conservation interventions. The goal is to identify actions that are likely to be effective based on prior evidence and have the potential to produce significant gains in biodiversity. This approach should then enhance the targeting and impact of future conservation efforts and make them more evidence driven. […298 words]
Real-time mapping of changes in species extinction risks / Jan 2024
This is an idea proposed as a Cambridge Computer Science PhD topic, and is currently being worked on by Emilio Luz-Ricca. It is supervised by Andrew Balmford and Anil Madhavapeddy.
Loss of habitat represents the most significant threat to wildlife overall, but advances in satellite sensing have enabled the assessment of habitat extent with comprehensive spatial coverage and reasonable temporal resolution. To address rising demand for metrics to quantify biodiversity, we have developed the LIFE metric (see Mapping LIFE on Earth) that models the effect of landuse changes on species extinction risk as a function of Areas of Habitat (AoH).
This PhD work explores how to deal with the anthropogenic threats beyond simple habitat loss, including hunting, agricultural practices, and the introduction of invasive species. These additional threatening processes degrade habitat quality and lower species occupancy, but are extremely difficult to observe directly via remote sensing. This project will therefore involve a combination of modelling, machine learning and remote sensing data analysis to understand the impact of these additional anthropogenic threats on habitat quality on a per-species basis.
Privacy preserving emissions disclosure techniques / Jan 2024
This is an idea proposed as a Cambridge Computer Science PhD topic, and is currently being worked on by Jessica Man. It is supervised by Martin Kleppmann and Anil Madhavapeddy.
Customers of online services may want to take carbon emissions into account when deciding which service to use, but are currently hindered by a lack of reliable emissions data that is comparable across services. Calculating accurate carbon emissions across a cloud computing pipeline involves a number of stakeholders, none of whom are incentivised to accurately report their emissions for competitive reasons.
This PhD explores mechanisms to support verifiable and privacy-preserving emissions reporting across a chain of energy suppliers, cloud data centres, virtual machine hosting services providers and cloud services providers, which are ultimately passed through to APIs used by customers. We hypothesise that adding verifiable and composable emissions transparency to cloud computing architectures enables providers to compete on the basis of sustainability, resulting in demand-side pressure on cloud services to shift to renewable energy sources.
We published a workshop paper on this topic in Emission Impossible: privacy-preserving carbon emissions claims.
Parallel traversal effect handlers for OCaml / Jan 2024
This is an idea proposed as a Cambridge Computer Science Part II project, and is currently being worked on by Sky Batchelor. It is supervised by Patrick Ferris and Anil Madhavapeddy.
Most existing uses of effect handlers perform synchronous execution of handled
effects. Xie et al proposed a traverse
handler for parallelisation of
independent effectful computations whose effect handlers are outside the
parallel part of the program. The paper [^1] gives a sample implementation as a
Haskell library with an associated λp calculus that formalises the parallel
handlers. […162 words]
Mapping hunting risks for wild meat in protected areas / Jan 2024
This is an idea proposed as a postdoctoral project, and is currently being worked on by Charles Emogor. It is supervised by Milind Tambe and Anil Madhavapeddy.
There is an important balance needed between the biodiversity damage caused by hunting in protected areas and the well-being of local communities that depend on it. One understudied driver of overly damaging hunting in these areas is snaring (as opposed to gun hunting) which potentially increases carcass wastage and hence causing biodiversity harm without proportionate benefit to the community.
This project examines how to improve the efficacy of anti-poaching ranger patrols while also plugging the knowledge gap around wild meat snaring. Both of these research topics can be tackled in a new light with the emergence of machine learning as a data-driven approach to deriving insights from sparse data, and particularly from some of the newer base maps being developed in our Mapping LIFE on Earth project.
Low-power sensing infrastructure for biodiversity / Jan 2024
This is an idea proposed as a Cambridge Computer Science PhD topic, and is currently being worked on by Josh Millar. It is supervised by Hamed Haddadi and Anil Madhavapeddy.
In-situ sensing devices need to be deployed in remote environments for long periods of time, and minimizing their power consumption is vital for maximising both their operational lifetime and coverage.
We are exploring the construction of a versatile multi-sensor device (initially based around the ESP32 chipset) and designing an exceptionally low power consumption model by using an on-device reinforcement learning scheduler that can learn to cooperate with other nearby devices.
Our prototype device setup for learning schedules for biodiversity monitoring does pretty well against a number of fixed schedules; the scheduler captures more than 80% of events at less than 50% of the number of activations of the best-performing fixed schedule. You can read more about this in Terracorder: Sense Long and Prosper.
Low-latency wayland compositor in OCaml / Jan 2024
This is an idea proposed as a Cambridge Computer Science Part II project, and is currently being worked on by Tom Thorogood. It is supervised by Ryan Gibb and Anil Madhavapeddy.
When building situated displays and hybrid streaming systems, we need fine-grained composition over what to show on the displays. Wayland is a communications protocol for next-generation display servers used in Unix-like systems.[^0]
It has been adopted as the default display server by Linux distributions including Fedora with KDE, and Ubuntu and Debian with GNOME. It aims to replace the venerable X display server with a modern alternative. X leaves logic such as window management to application software, which has allowed the proliferation of different approaches. Wayland, however, centralizes all this logic in the 'compositor', which assumes both display server and window manager roles.[^1] […267 words]
Legal perspectives on integrity issues in forest carbon / Jan 2024
This is an idea proposed as a postdoctoral project, and has been completed by Sophie Chapman. It was supervised by Anil Madhavapeddy and Eleanor Toye Scott.
Carbon finance offers a vital way to fund urgently needed forest conservation, but there are integrity issues on the supply side.[1] Besides the known issues with carbon quantification,[2] carbon credits are often poorly designed and implemented from a legal perspective. Specifically, in the absence of a clear legal framework for forest carbon credits, contracts tend to conceptualise credits in similar terms to the products of extractive industries, such as mineral mining. This is a factually inaccurate model for carbon credits, since the carbon is not extracted but on the contrary is stored in the trees which remain part of the landscape. This inappropriate model then leads to misunderstandings and misallocations of the rights of the various stakeholders in carbon finance projects and militates against just benefit-sharing arrangements.
This project is exploring a novel legal framework for forest carbon credits which separates carbon tenure (i.e. title and associated property rights to the land and trees which store the carbon) from the carbon rights (i.e. title and associated rights to monetise, sell, count and retire the credits which symbolically represent the carbon stored in the trees), while also specifying the relationship between the carbon tenure and the carbon rights.
-
See the note on Nature Sustainability commentary on carbon and biodiversity credits
↩︎︎ -
See the Trusted Carbon Credits project and related papers. […227 words]
↩︎︎
Implementing a higher-order choreographic language / Jan 2024
This is an idea proposed as a Cambridge Computer Science Part II project, and has been completed by Rokas Urbonas. It was supervised by Dmirtij Szamozvancev and Anil Madhavapeddy.
This project aims to implement a functional choreographic language inspired by the Pirouette calculus. This language was meant to make the notoriously difficult process of implementing distributed algorithms easier, while offering a practical execution model for multi-participant programs. Additionally, it aimed to match the expressiveness and performance of similar existing solutions.
The project completed very successfully, and resulted in ChorCaml, an embedded DSL for choreographic programming in OCaml. The language facilitates the implementation of distributed algorithms, while offering a clear syntax and safety via the type system. ChorCaml also improves upon existing alternatives in certain common use cases, both in terms of program conciseness and performance. The practicality of the DSL was verified by successfully implementing well-known distributed algortihms such as Diffie-Hellman key exchange and concurrent Karatsuba fast integer multiplication. […163 words]
Gradually debugging type errors / Jan 2024
This is an idea proposed as a Cambridge Computer Science Part II project, and is currently being worked on by Max Carroll. It is supervised by Patrick Ferris and Anil Madhavapeddy.
Reasoning about type errors is very difficult, and requires shifting between static and dynamic types. In OCaml, the type checker asserts ill-typedness but provides little in the way of understanding why the type checker inferred such types. These direct error messages are difficult to understand even for experienced programmers working on larger codebases.
This project will explore how to use gradual types to reason more effectively about such ill-typed programs, by introducing more dynamic types to help some users build an intuition about the problem in their code. The intention is to enable a more exploratory approach to constructing well-typed programs. […131 words]
Generating chunk-free embeddings for LLMs / Jan 2024
This is an idea proposed as a Cambrige Computer Science Part III or MPhil project, and is currently being worked on by Mark Jacobsen. It is supervised by Sadiq Jaffer and Anil Madhavapeddy.
This project aims to explore the development of a chunk-free approach for generating embeddings in Retrieval-Augmented Generation (RAG) models. Traditional RAG workflows often involve manual or predefined chunking of documents, and we seek to bypass this requirement.
Instead, our approach involves generating multiple embeddings for unchunked text using a synthetic dataset created by (e.g.) a 7b parameter LLM. This dataset would feature structured, point-by-point summaries of each paragraph. An off-the-shelf embedding model could then be modified by removing its mean pooling layer and incorporating cross-attention layers. These layers, inspired by T5's encoder-decoder architecture, would enable a frozen set of embeddings to interact with summary-based embeddings via cross-attention, creating a more nuanced chunk-free representation.
Additionally, the research aims to explore adaptive chunking driven by a trained model, allowing context-aware embedding generation end-to-end. This method promises a more integrated and efficient approach, eliminating the need for separate summarization and embedding processes.
Foundation models for complex geospatial tasks / Jan 2024
This is an idea proposed as a Cambridge Computer Science PhD topic, and is currently being worked on by Onkar Gulati. It is supervised by Sadiq Jaffer, Anil Madhavapeddy and David A Coomes.
Self-supervised learning (SSL) represents a shift in machine learning that enables versatile pretrained models to leverage the complex relationships present in dense–oftentimes multispectral and multimodal–remote sensing data. This in turn can accelerate how we address sophisticated downstream geospatial tasks for which current methodologies prove insufficient, ranging from land cover classification to urban building segmentation to crop yield measurement and wildfire forecasting.
This PhD project explores the question of how current SSL methodologies may be altered to tackle remote sensing tasks, and also how to make them amenable to incremental time-series generation as new data regularly comes in from sensing instruments.
Displaying the 15 most recent news items out of 58 in total (see all the items).