ZFS replication strategies with encryption / Jun 2025
This is an idea proposed as a good starter project, and is currently being worked on by Becky Terefe-Zenebe. It is co-supervised with Mark Elvers.
We are using ZFS in much of our Planetary Computing infrastructure due to its ease of remote replication. Therefore, its performance characteristics when used as a local filesystem are particularly interesting. Some questions that we need to answer about our uses of ZFS are:
- We intend to have an encrypted remote backups in several locations, but only a few of those hosts should have keys and the rest should use raw ZFS send streams.
- Does encryption add a significant overhead when used locally?
- Is replication faster if the source and target are both encrypted vs a raw send?
- We would typically have a snapshot schedule, such as hourly snapshots with a retention of 48 hours, daily snapshots with a retention of 14 days, and weekly snapshots with a retention of 8 weeks. As these snapshots build up over time, is there a performance degradation?
- Should we minimise the number of snapshots held locally, as this would allow faster purging of deleted files?
- How does ZFS send/receive compare to a peer-to-peer backup solution like Borg Backup, given that it gives a free choice of source and target backup file system and supports encryption?
- ZFS should have the advantage of knowing which blocks have changed between two backups, but potentially, this adds an overhead to day-to-day use.
- On the other hand, ZFS replicants can be brought online much more quickly, whereas Borg backup files need to be reconstructed into a usable filesystem.
Validating predictions with ranger insights to enhance anti-poaching patrol strategies in protected areas / Jun 2025
This is an idea proposed as a good starter project, and is currently being worked on by Hannah McLoone. It is co-supervised with Charles Emogor and Rob Fletcher.
Biodiversity is declining at an unprecedented rate, underscoring the critical role of protected areas (PAs) in conserving threatened species and ecosystems. Yet, many of these are increasingly dismissed as "paper parks" due to poor management. Park rangers play a vital role in PA effectiveness by detecting and potentially deterring illegal activities. However, limited funding for PA management has led to low patrol frequency and detection rates, reducing the overall deterrent effect of ranger efforts. This resource scarcity often results in non-systematic patrol strategies, which are sub-optimal given that illegal hunters tend to be selective in where and when they operate.
The situation is poised to become more challenging as countries expand PA coverage under the Kunming-Montreal Global Biodiversity Framework—aiming to increase global PA area from 123 million km2 to 153 million km2 by 2030. Without a substantial boost in enforcement capacity, both existing and newly designated PAs will remain vulnerable. Continued overexploitation of wildlife threatens not only species survival but also ecosystem integrity and the well-being of local communities who rely on wildlife for food and income.
This project aims to combine data from rangers in multiple African protected areas and hunters around a single protected area (Nigeria) to improve the deterrence effect of ranger patrols by optimising ranger efforts and provide information on the economic impacts of improved ranger patrols on community livelihoods and well-being. We plan to deploy our models to rangers in the field via SMART, which is used in > 1000 PAs globally to facilitate monitoring and data collection during patrols.
The two main aims are to:
- develop an accessibility layer using long-term ranger-collected data
- validate the results of this layer, as well as those from other models developed, using ranger insights.
Mapping urban and rural British hedgehogs / Jun 2025
This is an idea proposed as a good starter project, and is currently being worked on by Gabriel Mahler. It is co-supervised with Silviu Petrovan.
The National Hedgehog Monitoring Programme aims to provide robust population estimates for the beloved hedgehog.
Despite being the nation’s favourite mammal, there's a lot more to learn about hedgehog populations across the country. We do know that, although urban populations are faring better than their rural counterparts, overall hedgehogs are declining across Britain, so much so that they’re now categorised as Vulnerable to extinction. -- NHMP
The People's Trust for Endangered Species has been coordinating the programme. For the purposes of this project, we have access to:
- GPS data from over 100 tagged hedgehogs collected by Lauren Moore during her PhD to build predictive movement models.
- OpenStreetMap data about where hedgehogs probably shouldn't be (e.g. middle of a road) to help with species distribution modelling
- PTES also run the Hedgehog Street program which has the mapped locations of hedgehog highways across the UK to assess how effective they are.
- A new high-res map of the UK's hedgerows and stonewalls from Google DeepMind and Drew Purves.
Our initial efforts in the summer of 2025 will be to put together a high res map of UK hedgehog habitats, specifically brambles and likely urban habitats. Once that works, the plan is to apply some spatially explicit modeling, still focussing on the UK. This will involving exciting collaborating with the PTES who I'm looking forward to meeting!
Low power audio transcription with Whisper / Jun 2025
This is an idea proposed as a good starter project, and is currently being worked on by Dan Kvit. It is co-supervised with Josh Millar.
The rise of batteryless energy-harvesting platforms could enable ultra-low-power, long-term, maintenance-free deployments of sensors.
This project explores the deployment of the OpenAI Whisper audio transcription model onto embedded devices, initially starting with the rPi and moving onto smaller devices.
Habitat mapping of the Cairngormes Connect restoration area / Jun 2025
This is an idea proposed as a good starter project, and is currently being worked on by Isabel Mansley. It is co-supervised with David Coomes and Aland Chan.
The Cairngorms Connect is the largest landscape restoration project in the UK. Four landowners (RSPB, Wildlands Ltd, FLS, and NatureScot) embarked on a 200-year vision to restore over 600 km2 of land in the Cairngorms National Park with an emphasis on natural processes.
In July, 2023, the Centre for Landscape Regeneration commissioned a flight over a 400 km2 stretch of land over the area, collecting both high resolution RGBI imagery (0.1m ground resolution) and LiDAR data. Various research projects were built on this dataset, including studies into carbon cycling, shrub ecology, tree regeneration, and deadwood detection.
Existing habitat maps of the area are based on Sentinel 2 satellite data at a ground resolution of 10m. While this dataset provides a good basis for some research objectives, a habitat map that could leverage the high resolution of the aerial imagery would potentially be able to capture fine-scale variations in habitat structure more accurately. This project involves applying new developments in geospatial machine learning (specifically the Tessera one developed locally in Cambridge) to achieve this.
Evaluating a human-in-the-loop AI framework to improve inclusion criteria for evidence synthesis / Jun 2025
This is an idea proposed as a good starter project, and is currently being worked on by Radhika Agrawal. It is co-supervised with Alec Christie and Sadiq Jaffer.
Whenever we do evidence synthesis (especially for conservation outcomes) to distil the world's scientific literature into actionable insights, we have to decide on what published studies we will include or exclude, and why they are categorised as such. This can be a challenging process, and sometimes inclusion criteria may not be very reproducible or clearly defined, leading to confusion between reviewers and more time-consuming reviews.
In AI-assisted review methods, we are increasingly finding that LLMs may interpret inclusion criteria differently to human reviewers, potentially because human experts may implicitly assume certain things that are not obvious to those working outside the review team (or interpret things differently to fellow reviewers). We trialled an informal process earlier this year to iterate over the inclusion/exclusion criteria for an evidence synthesis using synthetic studies that represent "edge cases", whereby it is difficult to agree on whether they should be in or out. Through back-and-forth with an LLM, human reviewers were able to refine and improve their inclusion criteria.
This project will build on this work to develop a prototype, open-source tool that enables users to refine their inclusion criteria with the help of an LLM chatbot. This will be extremely useful for anyone conducting any type of evidence synthesis and so has great potential to be an impactful project beyond "just" the field of conservation.
Evaluating LLMs for providing evidence-based information on conservation actions / Jun 2025
This is an idea proposed as a good starter project, and is currently being worked on by Alex Wang. It is co-supervised with Alec Christie and Sadiq Jaffer.
We are building a Conservation Co-Pilot to improve worldwide conservation action through evidence-driven insights. Biodiversity loss is one of the biggest threats to our planet and to tackle it, we must improve the effectiveness of conservation action, which currently falls short of its full potential. This is because conservationists typically find it hard to access locally relevant evidence on what works to conserve biodiversity as research knowledge is not translated quickly or accessibly enough into policy and practice. We therefore need to accelerate the transfer of relevant, reliable evidence to decision-makers using more intuitive and interactive interfaces.
This project will use the comprehensive Conservation Evidence database (holding 8600 studies that have quantitatively tested 3600 actions) to evaluate the ability of a Mixture of Agents (MOA) approach, and/or individual LLMs, at providing rigorous evidence-based answers to priority questions from real conservationists.
This will extend our previous work that found that LLMs coupled with a hybrid retrieval strategy can answer multiple choice conservation questions as well as human experts. This will enable us to develop a "Conservation Co-Pilot" that can handle complex and nuanced questions from different users.
Using graph theory to define data-driven ecoregion and bioregion maps / Apr 2025
This is an idea proposed as a good starter project, and is available for being worked on. It may be co-supervised with Daniele Baisero and Michael Dales.
Maps of biologically driven regionalization (e.g. ecoregions and bioregions) are useful in conservation science and policy as they help identify areas with similar ecological characteristics, allowing for more targeted, efficient, and ecosystem-specific management strategies. These regions provide a framework for prioritizing conservation efforts, monitoring biodiversity, and aligning policies across political boundaries based on ecological realities rather than arbitrary lines. However these products have historically been "hand drawn" by experts and are mostly based on plant distribution data only.
[…270 words]Runtimes à la carte: crossloading native and bytecode OCaml / Apr 2025
This is an idea proposed as a good starter project, and is currently being worked on by Jeremy Chen. It is co-supervised with David Allsopp.
In 1998, Fabrice le Fessant released Efuns ("Emacs for Functions"), an implementation of an Emacs-like editor entire in OCaml and which included a library for loading bytecode within native code programs[^1].
This nearly a decade before OCaml 3.11 would introduce Alain Frisch's native Dynlink support to OCaml. Natdynlink means that this original work has been largely forgotten, but there remain two interesting applications for being able to "cross-load" code compiled for the OCaml bytecode runtime in an OCaml native code application and vice versa:
- Native code OCaml applications could use OCaml as a scripting language without needing to include an assembler toolchain or solutions such as ocaml-jit.
- The existing bytecode REPL could use OCaml natdynlink plugins (
.cmxs
files) directly, allowing more dynamic programming and exploration of high-performance libraries with the ease of the bytecode interpreter, but retaining the runtime performance of the libraries themselves.
Effects based scheduling for the OCaml compiler pipeline / Apr 2025
This is an idea proposed as a good starter project, and is currently being worked on by Lucas Ma. It is co-supervised with David Allsopp.
In order to compile the OCaml program foo.ml
containing:
Stdlib.print_endline "Hello, world"
the OCaml compilers only require the compiled stdlib.cmi
interface to exist in order to determine the type of Stdlib.print_endline
. This separate compilation technique allows modules of code to be compiled before the code they depend on has necessarily been compiled. When OCaml was first written, this technique was critical to reduce recompilation times. As CPU core counts increased through the late nineties and early 2000s, separate compilation also provided a parallelisation benefit, where modules which did not depend on each other could be compiled at the same time as each other benefitting compilation as well as recompilation.
For OCaml, as in many programming languages, the compilation of large code bases is handled by a separate build system (for example, dune
, make
or ocamlbuild
) with the compiler driver (ocamlc
or ocamlopt
) being invoked by that build system as required. In this project, we'll investigate how to get the OCaml compiler itself to be responsible for exploiting available parallelism.
Bidirectional Hazel to OCaml programming / Apr 2025
This is an idea proposed as a good starter project, and is currently being worked on by Max Carroll. It is co-supervised with Patrick Ferris and Cyrus Omar.
Hazel is a pure subset of OCaml with a live functional programming environment that is able to typecheck, manipulate, and even run incomplete programs. As a pure language with no effects, Hazel is a great choice for domains such as configuration languages where some control flow is needed, but not the full power of a general purpose programming language. On the other hand, Hazel only currently has an interpreter and so is fairly slow to evaluate compared to a full programming language such as OCaml.
[…277 words]Battery-free wildlife monitoring with Riotee / Apr 2025
This is an idea proposed as a good starter project, and is currently being worked on by Dominico Parish. It is co-supervised with Josh Millar.
Monitoring wildlife in the field today relies heavily on battery-powered devices, like GPS collars or acoustic recorders. However, such devices are often deployed in remote environments, where battery replacement and data retrieval can be labour-intensive and time-consuming. Moving away from battery-powered field devices could radically reduce the environmental footprint and labour cost of wildlife monitoring. The rise of batteryless energy-harvesting platforms could enable ultra-low-power, long-term, maintenance-free deployments. However, existing battery-less devices are severely constrained, often unable to perform meaningful on-device computation such as ML inference or high-frequency audio capture.
This project explores the development of next-generation, battery-less wildlife monitoring platforms using Riotee, an open-source platform purpose-built for intermittent computing. Riotee integrates energy harvesting with a powerful Cortex-M4 MCU and full SDK for managing state-saving, redundancy, and graceful resume from power failures.
[…273 words]Autoscaling geospatial computation with Python and Yirgacheffe / Apr 2025
This is an idea proposed as a good starter project, and is available for being worked on. It may be co-supervised with Michael Dales.
Python is a popular tool for geospatial data-science, but it, along with the GDAL library, handle resource management poorly. Python does not deal with parallelism well and GDAL can be a memory hog when parallelised. Geo-spatial workloads -- working on global maps at metre-level resolutions -- can easily exceed the resources available on a given host when run using conventional schedulers.
To that end, we've been building Yirgacheffe, a geospatial library for Python that attempts to both hide the tedious parts of geospatial work (aligning different data sources for instance), but also tackling the resource management issues so that ecologists don't have to also become computer scientists to scale their work. Yirgacheffe can:
- chunk data in memory automatically, to avoid common issues around memory overcommitment
- can do limited forms of parallelism to use multiple cores.
Yirgacheffe has been deployed in multiple geospatial pipelines, underpinning work like Mapping LIFE on Earth, as well as an implementation of the IUCN STAR metric, and a methodology for assessing tropical forest interventions.
[…453 words]An access library for the world crop, food production and consumption datasets / Apr 2025
This is an idea proposed as a good starter project, and is available for being worked on. It may be co-supervised with Alison Eyres and Thomas Ball.
Agricultural habitat degradation is a leading threat to global biodiversity. To make informed decisions, it's crucial to understand the biodiversity impacts of various foods, their origins, and potential mitigation strategies. Insights can drive actions from national policies to individual dietary choices. Key factors include knowing where crops are grown, their yields, and food sourcing by country.
The FAOSTAT trade data offers comprehensive import and export records since 1986, but its raw form is complex, including double counting, hindering the link between production and consumption.
[…372 words]3D printing the planet (or bits of it) / Apr 2025
This is an idea proposed as a good starter project, and is currently being worked on by Finley Stirk. It is co-supervised with Michael Dales.
Thanks to a combination of satellite information, remote sensors and data-science, we now are able to reason about places all over the globe from the comfort of our desks and offices. But sometimes, you just want to be able to see or touch an area to understand it properly: the flat 2D-projection on a screen doesnt necessarily reveal the subtle geography of a landscape, and data locked into a computer feels less immediate than even a physical model of the same area.
In recent work, Michael Dales has experimented with making 3D-printed models of surface terrain to make some areas of study more relatable. By combining high resolution Digital Elevation Maps (DEMs), and CAD software we were able to scale and print this section of a Swedish forest used to observe Moose migrations.
Displaying the 15 most recent news items out of 76 in total (see all the items)