PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles
Yitao Long, Yuru Jiang, Hongjun Liu, Yilun Zhao, Jingchen Sun, Yiqiu Shen, Chen Zhao, Arman Cohan, Dennis Shasha
2025-10-09
Summary
This research explores how well large AI models, called foundation models, can think through problems and make plans, especially when things are complicated and change over time.
What's the problem?
Currently, it's hard to reliably test if these powerful AI models can actually *reason* and *plan* effectively. Existing tests aren't diverse enough or challenging enough to really push their limits, and it's unclear how well they'll perform as they get even bigger and more complex. We need a way to measure these skills in a standardized and scalable way.
What's the solution?
The researchers created a new set of challenges called PuzzlePlex. It includes 15 different types of puzzles, some simple and some very difficult, some where you play against the AI and others where you play alone. They also built tools to measure how well the AI does on these puzzles, and compared different types of AI – some that follow instructions and others that write and run code. They then tested how performance changes as the models get larger.
Why it matters?
This work is important because it gives us a better way to evaluate and improve the reasoning and planning abilities of AI. By identifying the strengths and weaknesses of different AI approaches, we can build smarter and more reliable AI systems that can handle real-world problems that require complex thought and adaptation.
Abstract
This work investigates the reasoning and planning capabilities of foundation models and their scalability in complex, dynamic environments. We introduce PuzzlePlex, a benchmark designed to assess these capabilities through a diverse set of puzzles. PuzzlePlex consists of 15 types of puzzles, including deterministic and stochastic games of varying difficulty, as well as single-player and two-player scenarios. The PuzzlePlex framework provides a comprehensive environment for each game, and supports extensibility to generate more challenging instances as foundation models evolve. Additionally, we implement customized game-playing strategies for comparison. Building on this benchmark, we develop fine-grained metrics to measure performance and conduct an in-depth analysis of frontier foundation models across two settings: instruction-based and code-based. Furthermore, we systematically investigate their scaling limits. Our findings show that reasoning models outperform others in instruction-based settings, while code-based execution presents greater challenges but offers a scalable and efficient alternative. PuzzlePlex enables targeted evaluation and guides future improvements in reasoning, planning, and generalization for foundation models.