CL-bench: A Benchmark for Context Learning
Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, Huaibing Xie, Jianglu Hu, Shaolei Wang, Weichao Wang, Yanling Xiao, Yiting Liu, Zenan Xu, Zhen Guo, Pluto Zhou, Tao Gui, Zuxuan Wu, Xipeng Qiu
2026-02-05
Summary
This paper introduces a new way to test how well language models can truly *learn* from information given to them, rather than just relying on what they already know from their initial training.
What's the problem?
Current language models are good at using their existing knowledge to answer questions, but real-world problems are often complex and require understanding new, specific details provided *with* the problem itself. Models struggle to effectively learn and apply this new information, which is something humans do easily. Essentially, they can't adapt well to situations where they need to figure things out based on a new set of rules or facts presented to them.
What's the solution?
The researchers created a challenging benchmark called CL-bench. This benchmark includes 500 complex scenarios, almost 1900 individual tasks, and detailed ways to check if the model's answers are correct. Crucially, each task requires information *only* found within the provided scenario to solve – it’s not something the model should have known beforehand. They then tested ten of the most advanced language models on this benchmark.
Why it matters?
The results showed that even the best models only solved a little over 23% of the tasks, highlighting a significant weakness in their ability to learn from context. This is a major hurdle for using these models in real-world applications where they’ll constantly encounter new situations. CL-bench provides a tool to push the development of more intelligent language models that can truly learn and adapt, making them more useful and reliable.
Abstract
Current language models (LMs) excel at reasoning over prompts using pre-trained knowledge. However, real-world tasks are far more complex and context-dependent: models must learn from task-specific context and leverage new knowledge beyond what is learned during pre-training to reason and resolve tasks. We term this capability context learning, a crucial ability that humans naturally possess but has been largely overlooked. To this end, we introduce CL-bench, a real-world benchmark consisting of 500 complex contexts, 1,899 tasks, and 31,607 verification rubrics, all crafted by experienced domain experts. Each task is designed such that the new content required to resolve it is contained within the corresponding context. Resolving tasks in CL-bench requires models to learn from the context, ranging from new domain-specific knowledge, rule systems, and complex procedures to laws derived from empirical data, all of which are absent from pre-training. This goes far beyond long-context tasks that primarily test retrieval or reading comprehension, and in-context learning tasks, where models learn simple task patterns via instructions and demonstrations. Our evaluations of ten frontier LMs find that models solve only 17.2% of tasks on average. Even the best-performing model, GPT-5.1, solves only 23.7%, revealing that LMs have yet to achieve effective context learning, which poses a critical bottleneck for tackling real-world, complex context-dependent tasks. CL-bench represents a step towards building LMs with this fundamental capability, making them more intelligent and advancing their deployment in real-world scenarios.