TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval

Chien-Yu Lin, Keisuke Kamahori, Yiyu Liu, Xiaoxiang Shi, Madhav Kashyap, Yile Gu, Rulin Shao, Zihao Ye, Kan Zhu, Stephanie Wang, Arvind Krishnamurthy, Rohan Kadekodi, Luis Ceze, Baris Kasikci

2025-03-03

TeleRAG: Efficient Retrieval-Augmented Generation Inference with
Lookahead Retrieval

Summary

This paper talks about a new way to make AI models that can recognize images explain their decisions, called ProtoFM. It combines powerful Visual Foundation Models (VFMs) with a special design that helps the AI break down its choices into understandable parts.

What's the problem?

While AI models have gotten really good at recognizing things in images, it's often hard to understand why they make certain decisions. This is a big issue for important applications where we need to trust and verify the AI's choices. Current methods that try to explain AI decisions aren't always accurate or reliable.

What's the solution?

The researchers created ProtoFM, which adds a small, clever part (about 1 million parameters) on top of existing powerful image recognition AI models. This new part is trained to explain the AI's decisions using easy-to-understand concepts. They also developed special training techniques to make sure these explanations are accurate and meaningful.

Why it matters?

This matters because it makes AI image recognition more trustworthy and transparent. By helping us understand why an AI makes certain decisions, ProtoFM could make it safer to use AI in critical areas like medical diagnosis or self-driving cars. It also performs well in both recognizing images and explaining its choices, which is a big step forward in creating AI that's both powerful and understandable.

Abstract

Retrieval-augmented generation (RAG) extends large language models (LLMs) with external data sources to enhance factual correctness and domain coverage. Modern RAG pipelines rely on large datastores, leading to system challenges in latency-sensitive deployments, especially when limited GPU memory is available. To address these challenges, we propose TeleRAG, an efficient inference system that reduces RAG latency with minimal GPU memory requirements. The core innovation of TeleRAG is lookahead retrieval, a prefetching mechanism that anticipates required data and transfers it from CPU to GPU in parallel with LLM generation. By leveraging the modularity of RAG pipelines, the inverted file index (IVF) search algorithm and similarities between queries, TeleRAG optimally overlaps data movement and computation. Experimental results show that TeleRAG reduces end-to-end RAG inference latency by up to 1.72x on average compared to state-of-the-art systems, enabling faster, more memory-efficient deployments of advanced RAG applications.

View Paper