On the Theoretical Limitations of Embedding-Based Retrieval
Orion Weller, Michael Boratko, Iftekhar Naim, Jinhyuk Lee
2025-09-03
Summary
This paper investigates a fundamental limitation of how vector embeddings work, even when those embeddings are really good and used for seemingly simple tasks.
What's the problem?
Vector embeddings are used for a lot of things now, like finding relevant information or even helping with reasoning, but there's a hidden issue. The paper points out that there's a limit to how many different, correct answers an embedding model can actually retrieve, based on the size of the embedding itself. It's not just about needing more data or bigger models; the problem is built into the way these embeddings are designed. The authors argue that even with simple questions, this limit can be reached, meaning the model will fail to find all the correct information.
What's the solution?
The researchers connected this problem to ideas from learning theory, proving mathematically that the number of possible correct answers is limited by the embedding's dimension. They then tested this idea by creating a special dataset, called LIMIT, designed to specifically expose this weakness. They found that even the most advanced embedding models struggled with this dataset, even though the questions were very straightforward. They also directly optimized embeddings on the test set to see if they could overcome the limitation, but even that didn't fully solve the problem.
Why it matters?
This work is important because it shows that there are inherent limits to what current embedding models can achieve. It suggests that simply making models bigger or training them on more data won't necessarily fix the problem. This means researchers need to explore new approaches to information retrieval and reasoning that go beyond the current 'single vector' paradigm, potentially developing new ways to represent and access information.
Abstract
Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-following, coding, and more. These new benchmarks push embeddings to work for any query and any notion of relevance that could be given. While prior works have pointed out theoretical limitations of vector embeddings, there is a common assumption that these difficulties are exclusively due to unrealistic queries, and those that are not can be overcome with better training data and larger models. In this work, we demonstrate that we may encounter these theoretical limitations in realistic settings with extremely simple queries. We connect known results in learning theory, showing that the number of top-k subsets of documents capable of being returned as the result of some query is limited by the dimension of the embedding. We empirically show that this holds true even if we restrict to k=2, and directly optimize on the test set with free parameterized embeddings. We then create a realistic dataset called LIMIT that stress tests models based on these theoretical results, and observe that even state-of-the-art models fail on this dataset despite the simple nature of the task. Our work shows the limits of embedding models under the existing single vector paradigm and calls for future research to develop methods that can resolve this fundamental limitation.