Using computational SSDs for vector databases

This is an idea proposed in 2025 as a Cambridge Computer Science Part III or MPhil project, and is available for being worked on. It may be co-supervised with Sadiq Jaffer.

Large pre-trained models can be used to embed media/documents into concise vector representations with the property that vectors that are "close" to each other are semantically related. ANN (Approximate Nearest Neighbour) search on these embeddings is used heavily already in RAG systems for LLMs or search-by-example for satellite imagery.

Right now, most ANN databases almost exclusively use memory-resident indexes to accelerate this searching. This is a showstopper for larger datasets, such as the terabytes of PDFs we have for our big evidence synthesis project, each of which generates dozens of embeddings. For global satellite datasets for remote sensing of nature at 10m scale this is easily petabytes per year (the raw data here would need to come from tape drives).

The project idea is that computational storage devices can add compute (via FPGAs) to the SSD controller and let us compute on the data before it reaches main-memory. Binary-quantisation of embedding vectors is now practical ^[1], and so simple comparison of these should be quite amenable to acceleration with the SSD-attached FPGA. Since we're willing to tradeoff searching more vectors, each SSD only needs to have a lightweight index (potentially a flat IVF) shard. In a big storage array, every SSD could then return the small number of original (un-quantised) embeddings which were closest to the query points, and then the the CPU would do a fast final reranking step ^[2].

Our hypothesis is that we could scale vector database size just by adding more SSDs, through both storage and aggregate disk throughput. There are risks to overcome though: if the FPGAs on the SSD controllers dont have enough compute to keep up with the full SSD bandwidth, or we can't discard enough of a % of vectors via the on-disk index then we're memory bound without much gain. A key part of the solution is balancing out the memory vs SSD bandwidth carefully via some autotuning. (e.g. if we have 4TB per SSD shard we have 9GBs of max bandwidth, so we'd need to discard 99.9% of the on-disk indexed vectors to get sub-second response times).

But if the experiment does succeed, we could get real-time sub-second responses time on massive datasets, which would be a game changer for interaction exploration of huge datasets. A student more interested in the programming interface side may also wish to look over my OCaml FPGA notes.

https://arxiv.org/abs/2405.12497
↩︎︎
https://arxiv.org/abs/2106.00882
↩︎︎

# 1st Feb 2025

ideas data fpga idea-available idea-hard spatial storage

Anil Madhavapeddy, Professor of Planetary Computing

Using computational SSDs for vector databases

Related News

Programming FPGAs using OCaml / Feb 2025

Conservation Evidence Copilots / Jan 2024

Remote Sensing of Nature / Jan 2023

Planetary Computing / Jan 2022