A hardware description language using OCaml effects / Mar 2025
This is an idea proposed as a Cambridge Computer Science Part III or MPhil project, and is available for being worked on. It may be co-supervised with KC Sivaramakrishnan and Andy Ray.
Programming FPGAs using functional programming languages is a very good fit for the problem domain. OCaml has the HardCaml ecosystem to express hardware designs in OCaml, make generic designs using the power of the language, then simulate designs and convert them to Verilog or VHDL.
HardCaml is very successfully used in production at places like Jane Street, but needs quite a lot of prerequisite knowledge about the full OCaml language. In particular, it makes very heavy use of the module system in order to build up the circuit description as an OCaml data structure.
Instead of building up a circuit as the output of the OCaml program, it would be very cool if we could directly implement the circuit as OCaml code by evaluating it. This is an approach that works very successfully in the Clash Haskell HDL, as described in this thesis. Clash uses a number of advanced Haskell type-level features to encode fixed-length vectors (very convenient for hardware description) and has an interactive REPL that allows for exploration without requiring a separate test bench.
[…296 words]Using computational SSDs for vector databases / Feb 2025
This is an idea proposed as a Cambridge Computer Science Part III or MPhil project, and is available for being worked on. It may be co-supervised with Sadiq Jaffer.
Large pre-trained models can be used to embed media/documents into concise vector representations with the property that vectors that are "close" to each other are semantically related. ANN (Approximate Nearest Neighbour) search on these embeddings is used heavily already in RAG systems for LLMs or search-by-example for satellite imagery.
Right now, most ANN databases almost exclusively use memory-resident indexes to accelerate this searching. This is a showstopper for larger datasets, such as the terabytes of PDFs we have for our big evidence synthesis project, each of which generates dozens of embeddings. For global satellite datasets for remote sensing of nature at 10m scale this is easily petabytes per year (the raw data here would need to come from tape drives).
[…398 words]Affordable digitisation of insect collections using photogrammetry / Feb 2025
This is an idea proposed as a Cambridge Computer Science Part III or MPhil project, and is currently being worked on by Beatrice Spence, Arissa-Elena Rotunjanu and Anna Yiu. It is co-supervised with Tiffany Ki and Edgar Turner.
Insects dominate animal biodiversity and are sometimes called "the little things that run the world". They play a disproportionate role in ecosystem functioning, are highly sensitive to environmental change and often considered to be early indicators of responses in other taxa. There is widespread concern about global insect declines[^1] yet the evidence behind such declines is highly biassed towards the Global North and much is drawn from short-term biodiversity datasets[^2] [^3].
The Insect Collection at the University Museum of Zoology, Cambridge holds over 1.2 million specimens. These include specimens collected from the early 19th century to the present day. Most specimens remain undocumented and unavailable for analysis. However, they contain data that are critical to understanding long-term species and community responses to anthropogenic change, and vital to evaluating whether short-term declines are representative of longer-term trends[^4] [^5]. As such, unlocking these insect collections is of paramount importance, and the large-scale nature of these collections necessitates the development of an efficient and effective digitisation process.
The 3D digitisation of specimens using current methods is either highly time-intensive or expensive, rendering it impossible to achieve across the collection in a reasonable time-frame. Yet, 3D models of specimens have huge potential for investigating species morphological responses to anthropogenic changes over time and identification of trade-offs in morphological responses within a 3D morphospace.
[…540 words]Enhancing Navigation Algorithms with Semantic Embeddings / Aug 2024
This is an idea proposed as a Cambridge Computer Science Part III or MPhil project, and is currently being worked on by Gabriel Mahler.
Pathfinding algorithms used in modern navigation systems utilize a plethora of different geospatial data, such as from OpenStreetMap. Nevertheless, they often operate under one-size-fits-all assumptions and the simple objective of minimizing the anticipated travel time.
By leveraging vectorized geospatial descriptions, this project aims to build a framework for finding walking routes that seek to achieve much more customizable objectives. Given a set of specific requirements and preferences ("avoid dark streets at night"), we aim to leverage the semantic representation of a given area to select relevant geospatial data.
Once points of interest are selected, we then generate a specific walking route that seeks to fulfill the initial requirements by trying to maximize their vectorized similarity to a semantic representation of the route. The potential of the framework, and its contrasting versatility to existing path-finding algorithms, can be evaluated through experiments that reflect real-world scenarios such as accessibility, goals ("are we going shopping or just for a walk in nature?").
Related reading
Diffusion models for terrestrial predictions about land use change / May 2024
This is an idea proposed as a Cambridge Computer Science Part III or MPhil project, and has expired. It may be co-supervised with Sadiq Jaffer.
This project investigates how to build remote sensing data-driven models for the evolution of landscapes, which we can use to better predict deforestation, flooding and fire risks. Diffusion models are now widespread for image generation and are now being applied to video.[1] In addition the GenCast project from Google Deepmind used a diffusion model ensemble for weather forecasting, resulting in a high degree of accuracy compared to traditional methods.[2]
The goal of this project is to train a video diffusion model on time series of optical and radar satellite tiles and evaluate its performance in predicting changes in land use / land cover (such as deforestation or flooding).[3] A stretch goal is to build a user interface over this to predict and visualise the effects of a given change in land cover over time.
-
"GenCast: Diffusion-based ensemble forecasting for medium range weather", arXiv:2312.15796
↩︎︎ -
"Video Diffusion Models: A Survey" (May 2024), https://video-diffusion.github.io.
↩︎︎ -
"DiffusionSat: A Generative Foundation Model for Satellite Imagery" (Dec 2023)
↩︎︎
Spatial and multi-modal extraction from conservation literature / Jan 2024
This is an idea proposed as a Cambridge Computer Science Part III or MPhil project, and has expired. It may be co-supervised with Sadiq Jaffer, Alec Christie and Bill Sutherland.
The Conservation Evidence Copilots database contains information on numerous conservation actions and their supporting evidence. We also have access to a large corpus of academic literature detailing species presence and threats which we have assembled in Cambridge in collaboration with the various journal publishers.
This MPhil project aims to combine these published literature resources with geographic information to propose conservation interventions. The goal is to identify actions that are likely to be effective based on prior evidence and have the potential to produce significant gains in biodiversity. This approach should then enhance the targeting and impact of future conservation efforts and make them more evidence driven.
[…298 words]Generating chunk-free embeddings for LLMs / Jan 2024
This is an idea proposed as a Cambridge Computer Science Part III or MPhil project, and is currently being worked on by Mark Jacobsen. It is co-supervised with Sadiq Jaffer.
This project aims to explore the development of a chunk-free approach for generating embeddings in Retrieval-Augmented Generation (RAG) models. Traditional RAG workflows often involve manual or predefined chunking of documents, and we seek to bypass this requirement.
Instead, our approach involves generating multiple embeddings for unchunked text using a synthetic dataset created by (e.g.) a 7b parameter LLM. This dataset would feature structured, point-by-point summaries of each paragraph. An off-the-shelf embedding model could then be modified by removing its mean pooling layer and incorporating cross-attention layers. These layers, inspired by T5's encoder-decoder architecture, would enable a frozen set of embeddings to interact with summary-based embeddings via cross-attention, creating a more nuanced chunk-free representation.
Additionally, the research aims to explore adaptive chunking driven by a trained model, allowing context-aware embedding generation end-to-end. This method promises a more integrated and efficient approach, eliminating the need for separate summarization and embedding processes.
Deep learning for decomposing sound into vector audio / Jan 2024
This is an idea proposed as a Cambridge Computer Science Part III or MPhil project, and has expired. It may be co-supervised with Trevor Agus.
All that we hear is mediated through cues transmitted to the brain from the cochlea, which acts like a bank of auditory filters centred at a wide range of centre frequencies. A lot of our knowledge of hearing comes from psychoacoustical experiments that involve simple sounds, like sine waves, whose synthesis parameters are closely related to cues available beyond the cochlea. However, for recorded sounds, many types of cue are available, but our use of these cues is limited by the extent that these cues can be manipulated in a controlled fashion. [^1] [^2]
[…267 words]Composable diffing for heterogenous file formats / Jan 2024
This is an idea proposed as a Cambridge Computer Science Part III or MPhil project, and has expired. It may be co-supervised with Patrick Ferris.
When dealing with large scale geospatial data, we also have to deal with a variety of file formats, such as CSV, JSON, GeoJSON, or GeoTIFFs, etc. Each of these file formats has its own structure and semantics, and it is often necessary to compare and merge data across different file formats. The conventional solution with source code would be to use a tool such as Git to compare and merge data across different file formats. However, this approach is not always feasible, as it requires the data to be in a text-based format and the data to be structured in a way that can be compared line by line.
This project explores the design of a composable diffing specification that can compare and merge data across heterogenous file formats. The project will involve designing a domain-specific language for specifying the diffing rules, and implementing a prototype tool that can compare and merge data across different file formats. Crucially, the tool should be composable, meaning that it should be possible to combine different diffing rules to compare and merge data across different file formats.
[…309 words]Accurate summarisation of threats for conservation evidence literature / Jan 2024
This is an idea proposed as a Cambridge Computer Science Part III or MPhil project, and is currently being worked on by Kittson Hamill. It is co-supervised with Sadiq Jaffer.
At the Conservation Evidence Copilots project, we are interested in constructing a taxonomy of threats to wildlife from the literature. This involves scanning the body of conservation literature and gathering/synthesising evidence for conservation interventions from a threats perspective. Once the text has been retrieved, it needs to be summarised in a way that is accurate, concise and relevant and verified with human experts. This is particularly important for conservation evidence, where the key findings need to be communicated clearly to inform policy and practice.
This project therefore investigates how to generate threats, and to verify their accuracy as generated by LLMs and RAG pipelines from the CE literature. Our goal is to develop a pipeline that can reliably go from extracting relevant information from text to a summary that is verifiably (by a human) correct.
As of June 2025, the project has been successfully completed and submitted for Kittson's MPhil. A test version of the avian threats dataset is online for browsing, and we're spending the summer working on widening the evaluation with the wider CE team.
Related Reading
- The Ragas framework for RAG evaluation
- CheckEmbed: Effective Verification of LLM Solutions to Open Ended Tasks, arxiv:2406.02524v2, June 2024
- Calibrating Sequence Likelihood Improves Conditional Language Generation, arxiv:2210.00045, September 2000
Species distribution modelling using CNNs / Feb 2023
This is an idea proposed as a Cambridge Computer Science Part III or MPhil project, and has been completed by Emily Morris. It was co-supervised with David Coomes.
The goal of this project is to compare the performance of MaxEnt techniques to the performance of a CNN model for the task of species distribution modeling.
The CNN model will use remote sensing data as part of the input features. The remote sensing data we plan on using is a combination of LULC data (e.g. Dynamic World) and satellite imagery (Planet/Landsat 8/Sentinel 2). We will also use more classical environmental variables from WorldClim and soil data.
To evaluate it, we will focus on proteas for the species distribution modeling task. We have two observation data sets: the Protea Atlas and iNaturalist. The work for the CNN is largely based on the work done by Gillespie et al, who present a model that takes in an RGB image and an embedding for environment variables and predicts which species are present in the image. This method performs multispecies presence modeling and the use of other species is somewhat central to the method. Including other species gives training examples which are pseudo-absences for some species, circumventing the issue of the lack of negative data.
This project was conducted successfully, and presented at the CCAI Workshop at NeurIPS as 'Towards Scalable Deep Species Distribution Modelling using Global Remote Sensing'.
Reverse emulating agent-based models for policy simulation / Jan 2023
This is an idea proposed as a Cambridge Computer Science Part III or MPhil project, and has been completed by Pedro Sousa. It was co-supervised with Sadiq Jaffer.
Governments increasingly rely on simulation tools to inform policy design. Agent-based models (ABMs) simulate complex systems to study the emergent phenomena of individual behaviours and interactions in agent populations. However, these ABMs force an iterative, time-consuming, unmethodical parameter tuning of key policy "levers" (or input parameters) to steer the model towards the envisioned outcomes. To unlock a more natural workflow, this project investigates reverse emulation, a novel approach that streamlines policy design using probabilistic machine learning to predict parameter values that yield the desired policy outcomes.
[…192 words]Assessing high-performance lightweight compression formats for geospatial computation / Jan 2023
This is an idea proposed as a Cambridge Computer Science Part III or MPhil project, and has been completed by Omar Tanner. It was co-supervised with Sadiq Jaffer.
Geospatial data processing can benefit from by applying lightweight compression techniques to data in GeoTIFF format, addressing the challenge of modern CPU bandwidth surpassing RAM bandwidths. This project will explore how to mitigate the impact of poor cache locality and the resulting memory bottlenecks by leveraging CPU superscalar capabilities and SIMD instructions. By implementing SIMD-optimised compression, data can remain compressed in RAM and closer to the CPU caches, facilitating faster access and alleviating memory constraints.
[…113 words]Using effect handlers for efficient parallel scheduling / Jan 2022
This is an idea proposed as a Cambridge Computer Science Part III or MPhil project, and has been completed by Bartosz Modelski.
Modern hardware is so parallel and workloads are so concurrent, that there is no single, perfect scheduling strategy across a complex application software stack. Therefore, there are significant performance advantages to be gained from customizing and composing schedulers.
Multicore parallelism is here to stay, and in contrast with clock frequency increases, schedulers have to be carefully crafted in order to take full advantage of horizontal scaling of the underlying architecture. That’s because designs need to evolve as synchronization primitives such as locks or atomics do not scale endlessly to many cores, and a naive work stealing scheduler that may have been good enough on 16-thread Intel Xeon in 2012 will fail to utilize all 128 threads of a contemporary AMD ThreadRipper in 2022. Modern high-core architectures also feature non-uniform memory and so memory latency patterns vary with the topology. Scheduling decisions will benefit from taking mem- ory hierarchy into account. Moreover, the non-uniformity also appears also in consumer products such as Apple M1 or Intel Core i7-1280P. These highlight two sets of cores in modern architectures: one optimized for performance and another one for efficiency.
[…483 words]Spatial Name System / Jan 2022
This is an idea proposed as a Cambridge Computer Science Part III or MPhil project, and has been completed by Ryan Gibb. It was co-supervised with Jon Crowcroft.
The development of emerging classes of hardware such as Internet of Thing devices and Augmented Reality headsets has outpaced the development of Internet infrastructure. We identify problems with latency, security and privacy in the global hierarchical distributed Domain Name System. To remedy this, we propose the Spatial Name System, an alternative network architecture that relies on the innate physicality of this paradigm. Utilizing a device’s pre-existing unique identifier, its location, allows us to identify devices locally based on their physical presence. A naming system tailored to the physical world for ubiquitous computing can enable reliable, low latency, secure and private communication.
[…196 words]Displaying the 15 most recent news items out of 17 in total (see all the items)