home Anil Madhavapeddy, Professor of Planetary Computing  

Generating chunk-free embeddings for LLMs

This is an idea proposed in 2024 as a Cambridge Computer Science Part III or MPhil project, and is currently being worked on by Mark Jacobsen. It is supervised by Sadiq Jaffer and Anil Madhavapeddy.

This project aims to explore the development of a chunk-free approach for generating embeddings in Retrieval-Augmented Generation (RAG) models. Traditional RAG workflows often involve manual or predefined chunking of documents, and we seek to bypass this requirement.

Instead, our approach involves generating multiple embeddings for unchunked text using a synthetic dataset created by (e.g.) a 7b parameter LLM. This dataset would feature structured, point-by-point summaries of each paragraph. An off-the-shelf embedding model could then be modified by removing its mean pooling layer and incorporating cross-attention layers. These layers, inspired by T5's encoder-decoder architecture, would enable a frozen set of embeddings to interact with summary-based embeddings via cross-attention, creating a more nuanced chunk-free representation.

Additionally, the research aims to explore adaptive chunking driven by a trained model, allowing context-aware embedding generation end-to-end. This method promises a more integrated and efficient approach, eliminating the need for separate summarization and embedding processes.

# 1st Jan 2024   iconideas ai idea-hard idea-ongoing llms

Related News