This project proposes building a hybrid graph-vector database in OCaml.
Retrieval-Augmented Generation (RAG) grounds LLM responses in external data. The naive approach to implementing RAG is to embed relevant documents with a pre-trained embedding model, store them in a vector database, and then return the top k closest documents to an LLM's embedded query during a generation step.
This approach is often insufficient for high-quality LLM responses, as dense retrieval (via embedding closeness) has been shown to underperform more traditional keyword-based sparse retrieval (e.g. BM25') on several BEIR datasets. In practice, merging sparse and dense retrieval results improves recall and downstream RAG accuracy over using either in isolation.
Recent hybrid sparse/dense retrieval systems such as GraphRAG have further demonstrated the value of graph structure in the sparse retrieval component. By carefully setting up a knowledge graph to expose semantically meaningful edges between entities, models can be augmented with much more powerful search and retrieval capabilities over embedded external data. However, most such systems are text-only, built as offline batch processes, and slow to update. Their indices are typically reconstructed in large jobs rather than incrementally maintained, and they rarely target lightweight, embed-dable deployments. Likewise, general-purpose graph databases with vector add-ons tend to have runtime overhead and operational complexity that is undesirable for a small, embedded engine.
Over the course of the project, I will produce a single-machine graph-vector database in OCaml. The graph store will behave as a conventional graph database, allowing CRUD' operations on nodes, edges, and their associated types and metadata. The vector addition will allow the user to link a number of labelled vectors to a node or edge, and perform semantic search queries over the nodes and edges using the vectors. The graph store will be built on top of LMDB. --olifog, Part II project proposal, Nov 2025
Oliver completed this project using OxCaml with the full source code available at https://github.com/olifog/gvecdb-ocaml and even got most of arxiv embedded and visualised using his project!


