Using computational SSDs for vector databases / Jan 2025
This is an idea proposed as a Cambrige Computer Science Part III or MPhil project, and is available for being worked on. It will be supervised by Anil Madhavapeddy and Sadiq Jaffer.
Large pre-trained models can be used to embed media/documents into concise vector representations with the property that vectors that are "close" to each other are semantically related. ANN (Approximate Nearest Neighbour) search on these embeddings is used heavily already in RAG systems for LLMs or search-by-example for satellite imagery.
Right now, most ANN databases almost exclusively use memory-resident indexes to accelerate this searching. This is a showstopper for larger datasets, such as the terabytes of PDFs we have for our big evidence synthesis project, each of which generates dozens of embeddings. For global satellite datasets for remote sensing of nature at 10m scale this is easily petabytes per year (the raw data here would need to come from tape drives). […398 words]
Spatial and multi-modal extraction from conservation literature / Jan 2024
This is an idea proposed as a Cambrige Computer Science Part III or MPhil project, and is available for being worked on. It will be supervised by Anil Madhavapeddy, Sadiq Jaffer, Alec Christie and Bill Sutherland.
The Conservation Evidence Copilots database contains information on numerous conservation actions and their supporting evidence. We also have access to a large corpus of academic literature detailing species presence and threats which we have assembled in Cambridge in collaboration with the various journal publishers.
This MPhil project aims to combine these published literature resources with geographic information to propose conservation interventions. The goal is to identify actions that are likely to be effective based on prior evidence and have the potential to produce significant gains in biodiversity. This approach should then enhance the targeting and impact of future conservation efforts and make them more evidence driven. […298 words]
Generating chunk-free embeddings for LLMs / Jan 2024
This is an idea proposed as a Cambrige Computer Science Part III or MPhil project, and is currently being worked on by Mark Jacobsen. It is supervised by Sadiq Jaffer and Anil Madhavapeddy.
This project aims to explore the development of a chunk-free approach for generating embeddings in Retrieval-Augmented Generation (RAG) models. Traditional RAG workflows often involve manual or predefined chunking of documents, and we seek to bypass this requirement.
Instead, our approach involves generating multiple embeddings for unchunked text using a synthetic dataset created by (e.g.) a 7b parameter LLM. This dataset would feature structured, point-by-point summaries of each paragraph. An off-the-shelf embedding model could then be modified by removing its mean pooling layer and incorporating cross-attention layers. These layers, inspired by T5's encoder-decoder architecture, would enable a frozen set of embeddings to interact with summary-based embeddings via cross-attention, creating a more nuanced chunk-free representation.
Additionally, the research aims to explore adaptive chunking driven by a trained model, allowing context-aware embedding generation end-to-end. This method promises a more integrated and efficient approach, eliminating the need for separate summarization and embedding processes.
Diffusion models for terrestrial predictions about land use change / Jan 2024
This is an idea proposed as a Cambrige Computer Science Part III or MPhil project, and is available for being worked on. It will be supervised by Anil Madhavapeddy and Sadiq Jaffer.
This project investigates how to build remote sensing data-driven models for the evolution of landscapes, which we can use to better predict deforestation, flooding and fire risks. Diffusion models are now widespread for image generation and are now being applied to video.[1] In addition the GenCast project from Google Deepmind used a diffusion model ensemble for weather forecasting, resulting in a high degree of accuracy compared to traditional methods.[2]
The goal of this project is to train a video diffusion model on time series of optical and radar satellite tiles and evaluate its performance in predicting changes in land use / land cover (such as deforestation or flooding).[3] A stretch goal is to build a user interface over this to predict and visualise the effects of a given change in land cover over time.
-
"GenCast: Diffusion-based ensemble forecasting for medium range weather", arXiv:2312.15796
↩︎︎ -
"Video Diffusion Models: A Survey" (May 2024), https://video-diffusion.github.io.
↩︎︎ -
"DiffusionSat: A Generative Foundation Model for Satellite Imagery" (Dec 2023)
↩︎︎
Deep learning for decomposing sound into vector audio / Jan 2024
This is an idea proposed as a Cambrige Computer Science Part III or MPhil project, and is available for being worked on. It will be supervised by Trevor Agus and Anil Madhavapeddy.
All that we hear is mediated through cues transmitted to the brain from the cochlea, which acts like a bank of auditory filters centred at a wide range of centre frequencies. A lot of our knowledge of hearing comes from psychoacoustical experiments that involve simple sounds, like sine waves, whose synthesis parameters are closely related to cues available beyond the cochlea. However, for recorded sounds, many types of cue are available, but our use of these cues is limited by the extent that these cues can be manipulated in a controlled fashion. [^1] [^2] […267 words]
Composable diffing for heterogenous file formats / Jan 2024
This is an idea proposed as a Cambrige Computer Science Part III or MPhil project, and is available for being worked on. It will be supervised by Patrick Ferris and Anil Madhavapeddy.
When dealing with large scale geospatial data, we also have to deal with a variety of file formats, such as CSV, JSON, GeoJSON, or GeoTIFFs, etc. Each of these file formats has its own structure and semantics, and it is often necessary to compare and merge data across different file formats. The conventional solution with source code would be to use a tool such as Git to compare and merge data across different file formats. However, this approach is not always feasible, as it requires the data to be in a text-based format and the data to be structured in a way that can be compared line by line.
This project explores the design of a composable diffing specification that can compare and merge data across heterogenous file formats. The project will involve designing a domain-specific language for specifying the diffing rules, and implementing a prototype tool that can compare and merge data across different file formats. Crucially, the tool should be composable, meaning that it should be possible to combine different diffing rules to compare and merge data across different file formats. […309 words]
Accurate summarisation of threats for conservation evidence literature / Jan 2024
This is an idea proposed as a Cambrige Computer Science Part III or MPhil project, and is currently being worked on by Kittson Hamill. It is supervised by Anil Madhavapeddy and Sadiq Jaffer.
At the Conservation Evidence Copilots project, we are interested in constructing a taxonomy of threats to wildlife from the literature. This involves scanning the body of conservation literature and gathering/synthesising evidence for conservation interventions from a threats perspective. Once the text has been retrieved, it needs to be summarised in a way that is accurate, concise and relevant and verified with human experts. This is particularly important for conservation evidence, where the key findings need to be communicated clearly to inform policy and practice.
This project therefore investigates how to generate threats, and to verify their accuracy as generated by LLMs and RAG pipelines from the CE literature. Our goal is to develop a pipeline that can reliably go from extracting relevant information from text to a summary that is verifiably (by a human) correct. […168 words]
Species distribution modelling using CNNs / Jan 2023
This is an idea proposed as a Cambrige Computer Science Part III or MPhil project, and has been completed by Emily Morris. It was supervised by Anil Madhavapeddy and David A Coomes.
The goal of this project is to compare the performance of MaxEnt techniques to the performance of a CNN model for the task of species distribution modeling.
The CNN model will use remote sensing data as part of the input features. The remote sensing data we plan on using is a combination of LULC data (e.g. Dynamic World) and satellite imagery (Planet/Landsat 8/Sentinel 2). We will also use more classical environmental variables from WorldClim and soil data.
To evaluate it, we will focus on proteas for the species distribution modeling task. We have two observation data sets: the Protea Atlas and iNaturalist. The work for the CNN is largely based on the work done by Gillespie et al, who present a model that takes in an RGB image and an embedding for environment variables and predicts which species are present in the image. This method performs multispecies presence modeling and the use of other species is somewhat central to the method. Including other species gives training examples which are pseudo-absences for some species, circumventing the issue of the lack of negative data.
This project was conducted successfully, and presented at the CCAI Workshop at NeurIPS as 'Towards Scalable Deep Species Distribution Modelling using Global Remote Sensing'.
Reverse emulating agent-based models for policy simulation / Jan 2023
This is an idea proposed as a Cambrige Computer Science Part III or MPhil project, and has been completed by Pedro Sousa. It was supervised by Anil Madhavapeddy and Sadiq Jaffer.
Governments increasingly rely on simulation tools to inform policy design. Agent-based models (ABMs) simulate complex systems to study the emergent phenomena of individual behaviours and interactions in agent populations. However, these ABMs force an iterative, time-consuming, unmethodical parameter tuning of key policy "levers" (or input parameters) to steer the model towards the envisioned outcomes. To unlock a more natural workflow, this project investigates reverse emulation, a novel approach that streamlines policy design using probabilistic machine learning to predict parameter values that yield the desired policy outcomes. […192 words]
Assessing high-performance lightweight compression formats for geospatial computation / Jan 2023
This is an idea proposed as a Cambrige Computer Science Part III or MPhil project, and has been completed by Omar Tanner. It was supervised by Anil Madhavapeddy and Sadiq Jaffer.
Geospatial data processing can benefit from by applying lightweight compression techniques to data in GeoTIFF format, addressing the challenge of modern CPU bandwidth surpassing RAM bandwidths. This project will explore how to mitigate the impact of poor cache locality and the resulting memory bottlenecks by leveraging CPU superscalar capabilities and SIMD instructions. By implementing SIMD-optimised compression, data can remain compressed in RAM and closer to the CPU caches, facilitating faster access and alleviating memory constraints. […113 words]
Using effect handlers for efficient parallel scheduling / Jan 2022
This is an idea proposed as a Cambrige Computer Science Part III or MPhil project, and has been completed by Bartosz Modelski. It was supervised by Anil Madhavapeddy.
Modern hardware is so parallel and workloads are so concurrent, that there is no single, perfect scheduling strategy across a complex application software stack. Therefore, there are significant performance advantages to be gained from customizing and composing schedulers.
Multicore parallelism is here to stay, and in contrast with clock frequency increases, schedulers have to be carefully crafted in order to take full advantage of horizontal scaling of the underlying architecture. That’s because designs need to evolve as synchronization primitives such as locks or atomics do not scale endlessly to many cores, and a naive work stealing scheduler that may have been good enough on 16-thread Intel Xeon in 2012 will fail to utilize all 128 threads of a contemporary AMD ThreadRipper in 2022. Modern high-core architectures also feature non-uniform memory and so memory latency patterns vary with the topology. Scheduling decisions will benefit from taking mem- ory hierarchy into account. Moreover, the non-uniformity also appears also in consumer products such as Apple M1 or Intel Core i7-1280P. These highlight two sets of cores in modern architectures: one optimized for performance and another one for efficiency. […483 words]
Spatial Name System / Jan 2022
This is an idea proposed as a Cambrige Computer Science Part III or MPhil project, and has been completed by Ryan Gibb. It was supervised by Anil Madhavapeddy and Jon Crowcroft.
The development of emerging classes of hardware such as Internet of Thing devices and Augmented Reality headsets has outpaced the development of Internet infrastructure. We identify problems with latency, security and privacy in the global hierarchical distributed Domain Name System. To remedy this, we propose the Spatial Name System, an alternative network architecture that relies on the innate physicality of this paradigm. Utilizing a device’s pre-existing unique identifier, its location, allows us to identify devices locally based on their physical presence. A naming system tailored to the physical world for ubiquitous computing can enable reliable, low latency, secure and private communication. […196 words]
Scalable agent-based models for optimized policy design / Jan 2022
This is an idea proposed as a Cambrige Computer Science Part III or MPhil project, and has been completed by Sharan Agrawal. It was supervised by Anil Madhavapeddy and Srinivasan Keshav.
As the world faces twinned crises of climate change and biodiversity loss, the need for integrated policy approaches addressing both is paramount. To help address this, this project investigates a new agent-based model dubbed the VDSK-B. Using Dasgupta's review of the economics of biodiversity, it builds on the Dystopian Schumpeter meets Keynes (DSK) climate economics model to link together the climate, economy and biosphere. This is the first ABM proposed that integrates all 3 key elements.
The project also investigates how to scale such ABMs to be applicable for global policy design and scale to planetary-sized models. A new ABM framework called SalVO expresses agent updates as recursive applications of pure agent functions. This formalism differs from existing computational ABM models but is shown to be expressive enough to emulate a Turing complete language. SalVO is built on a JAX backend and designed to be scalable, vectorized, and optimizable. Employing hardware acceleration, tests showed it was more performant and more able to scale on a single machine than any existing ABM framework, such as FLAME (GPU). […252 words]
Void Processes: Minimising privilege by default / Jan 2021
This is an idea proposed as a Cambrige Computer Science Part III or MPhil project, and has been completed by Jake Hillion. It was supervised by Anil Madhavapeddy.
Void processes intend to make it easier for all developers to produce effectively privilege separated applications. The project has two primary goals: show the merits of starting from zero privilege, and provide the utilities to make this feasible for the average developer.
Building void processes involves first reliably removing all privilege from a process then systematically adding back in what is required, and no more. This project utilises Linux namespaces to revoke privilege from an application, showing how this can be done and why its easier in some domains than others. It then shows how to inject sufficient privilege for applications to perform useful work, developing new APIs that are friendly for privilege separation. These elements compose a shim called the "void orchestrator", a framework for restricting Linux processes. […158 words]