Using computational SSDs for vector databases
This is an idea proposed in 2025 as a Cambridge Computer Science Part III or MPhil project, and is available for being worked on. It may be co-supervised with
Large pre-trained models can be used to embed media/documents into concise vector representations with the property that vectors that are "close" to each other are semantically related. ANN (Approximate Nearest Neighbour) search on these embeddings is used heavily already in RAG systems for LLMs or search-by-example for satellite imagery.
Right now, most ANN databases almost exclusively use memory-resident indexes to accelerate this searching. This is a showstopper for larger datasets, such as the terabytes of PDFs we have for our
The project idea is that computational storage devices can add compute (via FPGAs) to the SSD controller and let us compute on the data before it reaches main-memory. Binary-quantisation of embedding vectors is now practical https://arxiv.org/abs/2405.12497 https://arxiv.org/abs/2106.00882
Our hypothesis is that we could scale vector database size just by adding more SSDs, through both storage and aggregate disk throughput. There are risks to overcome though: if the FPGAs on the SSD controllers dont have enough compute to keep up with the full SSD bandwidth, or we can't discard enough of a % of vectors via the on-disk index then we're memory bound without much gain. A key part of the solution is balancing out the memory vs SSD bandwidth carefully via some autotuning. (e.g. if we have 4TB per SSD shard we have 9GBs of max bandwidth, so we'd need to discard 99.9% of the on-disk indexed vectors to get sub-second response times).
But if the experiment does succeed, we could get real-time sub-second responses time on massive datasets, which would be a game changer for interaction exploration of huge datasets. A student more interested in the programming interface side may also wish to look over my