This is an idea proposed in 2024 as a Cambrige Computer Science Part III or MPhil project, and is currently being worked on by Mark Jacobsen. It is supervised by Sadiq Jaffer and Anil Madhavapeddy as part of my Conservation Evidence Copilots project.
This project aims to explore the development of a chunk-free approach for generating embeddings in Retrieval-Augmented Generation (RAG) models. Traditional RAG workflows often involve manual or predefined chunking of documents, and we seek to bypass this requirement.
Instead, our approach involves generating multiple embeddings for unchunked text using a synthetic dataset created by (e.g.) a 7b parameter LLM. This dataset would feature structured, point-by-point summaries of each paragraph. An off-the-shelf embedding model could then be modified by removing its mean pooling layer and incorporating cross-attention layers. These layers, inspired by T5's encoder-decoder architecture, would enable a frozen set of embeddings to interact with summary-based embeddings via cross-attention, creating a more nuanced chunk-free representation.
Additionally, the research aims to explore adaptive chunking driven by a trained model, allowing context-aware embedding generation end-to-end. This method promises a more integrated and efficient approach, eliminating the need for separate summarization and embedding processes.