TextQuests: How Good are LLMs at Text-Based Video Games?
Long Phan, Mantas Mazeika, Andy Zou, Dan Hendrycks
2025-08-12
Summary
This paper talks about TextQuests, a new way to test how good large language models (LLMs) are at playing long and complex text-based video games. These games are like interactive stories where the player types commands to explore, solve puzzles, and progress. The goal is to see how well AI can reason, plan, and solve problems by itself without using any outside help.
What's the problem?
The problem is that most current tests for AI focus on specific skills or tasks where the AI can use tools or look things up, which doesn't really show if the AI can think and reason deeply on its own for a long time. AI models often struggle in environments that need them to remember everything they've done, learn by trial and error, and make detailed plans without extra aids.
What's the solution?
The researchers created TextQuests by using 25 classic text-based adventure games that take a long time for humans to finish and require many precise steps. The benchmark forces AI agents to solve problems just by using their own reasoning over a long session, without relying on external tools or shortcuts. This helps measure how well AI can explore, learn from mistakes, and solve complex puzzles by itself.
Why it matters?
This matters because real-world AI needs to be good at thinking through complicated problems all on its own, not just follow simple commands or use tools. TextQuests helps researchers understand where AI currently stands in this tough challenge, pushing progress for smarter, more independent artificial intelligence that could be better assistants and problem solvers in everyday life.
Abstract
TextQuests evaluates AI agents' intrinsic reasoning and problem-solving capabilities in long, exploratory, text-based interactive fiction environments without external tools.