NoLiMa: Long-Context Evaluation Beyond Literal Matching

Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, Hinrich Schütze

2025-02-13

NoLiMa: Long-Context Evaluation Beyond Literal Matching

Summary

This paper talks about NoLiMa, a new way to test how well large language models (LLMs) can understand and work with long pieces of text. It's like creating a harder version of a word search puzzle to see if the AI can truly understand the meaning behind words, not just find matching letters.

What's the problem?

Current tests for LLMs are too easy because they let the AI find answers by simply matching words between questions and the text. This doesn't really show if the AI understands the meaning or can make connections between ideas. It's like letting someone ace a test by just circling words that appear in both the question and the answer key, without actually understanding the subject.

What's the solution?

The researchers created NoLiMa, which makes the test harder by removing obvious word matches between questions and answers. This forces the AI to actually understand the meaning and make connections between ideas to find the right information in a long text. They tested 12 popular AI models that claim to handle really long texts and found that most of them struggle a lot when the text gets longer and they can't rely on simple word matching.

Why it matters?

This matters because as we use AI for more complex tasks, we need to know if it truly understands long documents or conversations. NoLiMa shows that current AI models might not be as good at handling long texts as we thought, which is important for things like analyzing big documents, answering questions about long stories, or understanding complex situations. It helps researchers know what they need to improve to make AI smarter and more useful for real-world tasks that involve lots of information.

Abstract

Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information.

View Paper