ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, Yejin Choi
2025-02-04

Summary
This paper talks about ZebraLogic, a new way to test how well large language models (LLMs) can think logically and solve complex puzzles. It's like creating a super-smart IQ test for AI that uses logic grid puzzles to see how these models handle increasingly difficult problems.
What's the problem?
Even though AI models are getting really good at many tasks, they still struggle with complex logical reasoning, especially when problems get more complicated. It's hard to know exactly how well these models can think logically and what their limits are as problems become more difficult.
What's the solution?
The researchers created ZebraLogic, a system that generates logic grid puzzles with different levels of difficulty. These puzzles are based on something called constraint satisfaction problems. ZebraLogic lets researchers test popular AI models like Llama and DeepSeek-R1 on these puzzles to see how they perform as the problems get harder. They also tried different ways to help the AI models do better, like giving them multiple chances to solve a problem or letting them check their own work.
Why it matters?
This research matters because it helps us understand the limits of current AI technology when it comes to logical thinking. By showing that even the best AI models struggle with complex reasoning tasks, it points out areas where we need to improve. This could lead to developing smarter AI systems that can handle more complex real-world problems in fields like planning, scheduling, or problem-solving. It also reminds us that while AI is advancing quickly, there are still some types of thinking that humans do better.
Abstract
We investigate the logical reasoning capabilities of large language models (LLMs) and their scalability in complex non-monotonic reasoning. To this end, we introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM reasoning performance on logic grid puzzles derived from constraint satisfaction problems (CSPs). ZebraLogic enables the generation of puzzles with controllable and quantifiable complexity, facilitating a systematic study of the scaling limits of models such as Llama, o1 models, and DeepSeek-R1. By encompassing a broad range of search space complexities and diverse logical constraints, ZebraLogic provides a structured environment to evaluate reasoning under increasing difficulty. Our results reveal a significant decline in accuracy as problem complexity grows -- a phenomenon we term the curse of complexity. This limitation persists even with larger models and increased inference-time computation, suggesting inherent constraints in current LLM reasoning capabilities. Additionally, we explore strategies to enhance logical reasoning, including Best-of-N sampling, backtracking mechanisms, and self-verification prompts. Our findings offer critical insights into the scalability of LLM reasoning, highlight fundamental limitations, and outline potential directions for improvement.