PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models
Carolyn Jane Anderson, Joydeep Biswas, Aleksander Boruch-Gruszecki, Federico Cassano, Molly Q Feldman, Arjun Guha, Francesca Lucchetti, Zixuan Wu
2025-02-04
Summary
This paper talks about a new benchmark that uses puzzles from NPR's Sunday Puzzle Challenge to test how well AI models can reason and solve problems. The puzzles are designed to be tricky but require only general knowledge, making them a good way to evaluate reasoning abilities.
What's the problem?
Most existing tests for AI focus on very specialized knowledge, like advanced math or science, which doesn’t reflect how AI might be used in everyday situations. These tests also make it hard to spot reasoning mistakes or measure progress as AI improves. There’s a need for benchmarks that are easier to understand and focus on general problem-solving skills.
What's the solution?
The researchers created a benchmark based on 600 puzzles from the NPR Sunday Puzzle Challenge. These puzzles require insight and elimination rather than memorized facts, making them ideal for testing reasoning. They found that reasoning-focused AI models like OpenAI’s o1 performed better than others but still showed weaknesses, such as giving up or being uncertain about answers. The benchmark is updated weekly with new puzzles to keep it fresh and prevent AI from simply memorizing solutions.
Why it matters?
This research is important because it provides a more accessible way to test AI reasoning skills, focusing on tasks that humans can solve with general knowledge. It helps identify gaps in AI capabilities and shows where improvements are needed. By using human-like challenges, this benchmark pushes AI development toward solving real-world problems more effectively.
Abstract
Existing benchmarks for frontier models often test specialized, ``PhD-level'' knowledge that is difficult for non-experts to grasp. In contrast, we present a benchmark based on the NPR Sunday Puzzle Challenge that requires only general knowledge. Our benchmark is challenging for both humans and models, however correct solutions are easy to verify, and models' mistakes are easy to spot. Our work reveals capability gaps that are not evident in existing benchmarks: OpenAI o1 significantly outperforms other reasoning models that are on par on benchmarks that test specialized knowledge. Furthermore, our analysis of reasoning outputs uncovers new kinds of failures. DeepSeek R1, for instance, often concedes with ``I give up'' before providing an answer that it knows is wrong. R1 can also be remarkably ``uncertain'' in its output and in rare cases, it does not ``finish thinking,'' which suggests the need for an inference-time technique to ``wrap up'' before the context window limit is reached. We also quantify the effectiveness of reasoning longer with R1 and Gemini Thinking to identify the point beyond which more reasoning is unlikely to improve accuracy on our benchmark.