Quantifying Generalization Complexity for Large Language Models
Zhenting Qi, Hongyin Luo, Xuliang Huang, Zhuokai Zhao, Yibo Jiang, Xiangjun Fan, Himabindu Lakkaraju, James Glass
2024-10-03

Summary
This paper discusses a new evaluation framework called Scylla, which measures how well large language models (LLMs) can generalize their learning to new tasks instead of just memorizing answers.
What's the problem?
Large language models have shown they can handle complex questions and tasks, but it's often unclear how much of their performance comes from actually understanding the material versus just remembering it. This makes it difficult to evaluate their true reasoning abilities, especially when faced with new or different types of problems.
What's the solution?
To tackle this issue, the researchers developed Scylla, a framework that separates generalization from memorization by testing LLMs on two types of data: in-distribution (ID) data, which is similar to what the model was trained on, and out-of-distribution (OOD) data, which is different. They tested the models on 20 tasks with varying levels of difficulty and discovered a pattern where models performed differently based on task complexity. They identified a 'generalization valley,' where models struggle the most with certain types of tasks. Additionally, they found that larger models can handle more complex tasks before they start relying too much on memorization.
Why it matters?
This research is important because it helps improve our understanding of how LLMs learn and apply knowledge. By using Scylla to evaluate these models more effectively, developers can create better AI systems that truly understand language and reasoning, leading to more reliable applications in areas like education, customer service, and content creation.
Abstract
While large language models (LLMs) have shown exceptional capabilities in understanding complex queries and performing sophisticated tasks, their generalization abilities are often deeply entangled with memorization, necessitating more precise evaluation. To address this challenge, we introduce Scylla, a dynamic evaluation framework that quantitatively measures the generalization abilities of LLMs. Scylla disentangles generalization from memorization via assessing model performance on both in-distribution (ID) and out-of-distribution (OOD) data through 20 tasks across 5 levels of complexity. Through extensive experiments, we uncover a non-monotonic relationship between task complexity and the performance gap between ID and OOD data, which we term the generalization valley. Specifically, this phenomenon reveals a critical threshold - referred to as critical complexity - where reliance on non-generalizable behavior peaks, indicating the upper bound of LLMs' generalization capabilities. As model size increases, the critical complexity shifts toward higher levels of task complexity, suggesting that larger models can handle more complex reasoning tasks before over-relying on memorization. Leveraging Scylla and the concept of critical complexity, we benchmark 28LLMs including both open-sourced models such as LLaMA and Qwen families, and close-sourced models like Claude and GPT, providing a more robust evaluation and establishing a clearer understanding of LLMs' generalization capabilities.