Using computational SSDs for vector databases
This is an idea proposed in 2025 as a Cambrige Computer Science Part III or MPhil project, and is available for being worked on. It will be supervised by Anil Madhavapeddy and Sadiq Jaffer.
Large pre-trained models can be used to embed media/documents into concise vector representations with the property that vectors that are "close" to each other are semantically related. ANN (Approximate Nearest Neighbour) search on these embeddings is used heavily already in RAG systems for LLMs or search-by-example for satellite imagery.
Right now, most ANN databases almost exclusively use memory-resident indexes to accelerate this searching. This is a showstopper for larger datasets, such as the terabytes of PDFs we have for our big evidence synthesis project, each of which generates dozens of embeddings. For global satellite datasets for remote sensing of nature at 10m scale this is easily petabytes per year (the raw data here would need to come from tape drives).
The project idea is that computational storage devices can add compute (via FPGAs) to the SSD controller and let us compute on the data before it reaches main-memory. Binary-quantisation of embedding vectors is now practical [1], and so simple comparison of these should be quite amenable to acceleration with the SSD-attached FPGA. Since we're willing to tradeoff searching more vectors, each SSD only needs to have a lightweight index (potentially a flat IVF) shard. In a big storage array, every SSD could then return the small number of original (un-quantised) embeddings which were closest to the query points, and then the the CPU would do a fast final reranking step [2].
Our hypothesis is that we could scale vector database size just by adding more SSDs, through both storage and aggregate disk throughput. There are risks to overcome though: if the FPGAs on the SSD controllers dont have enough compute to keep up with the full SSD bandwidth, or we can't discard enough of a % of vectors via the on-disk index then we're memory bound without much gain. A key part of the solution is balancing out the memory vs SSD bandwidth carefully via some autotuning. (e.g. if we have 4TB per SSD shard we have 9GBs of max bandwidth, so we'd need to discard 99.9% of the on-disk indexed vectors to get sub-second response times).
But if the experiment does succeed, we could get real-time sub-second responses time on massive datasets, which would be a game changer for interaction exploration of huge datasets. A student more interested in the programming interface side may also wish to look over my OCaml FPGA notes.
Related News
- Programming FPGAs using OCaml / Feb 2025
- Conservation Evidence Copilots / Jan 2024
- Remote Sensing of Nature / Jan 2023
- Planetary Computing / Jan 2022