TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents
Dawei Wang, Chengming Zhou, Di Zhao, Xinyuan Liu, Marci Chi Ma, Gary Ushaw, Richard Davison
2026-01-12
Summary
This paper introduces TowerMind, a new computer environment designed to test how well large language models (LLMs) can plan and make decisions, specifically by having them play a tower defense game.
What's the problem?
Currently, testing LLMs’ ability to think strategically and adapt in real-time is difficult. Existing game environments that could be used for this purpose are either too demanding for computers to run easily, or they don’t allow the LLM to ‘see’ the game in a way that’s similar to how humans do – meaning they can’t process both visual information and text descriptions. This limits our ability to properly evaluate these models.
What's the solution?
The researchers created TowerMind, a tower defense game environment that’s designed to be easy to run on computers and provides LLMs with multiple ways to observe the game: through pixels (like a screenshot), text descriptions, and structured game data. This allows for a more comprehensive evaluation. They then tested several LLMs and some traditional AI methods on this game, looking at how well they planned, how often they made mistakes, and how efficiently they used their actions.
Why it matters?
TowerMind is important because it provides a new, accessible way to test and improve the strategic thinking abilities of LLMs. The results show that current LLMs still have significant limitations in planning, adapting to different situations, and avoiding errors, highlighting areas where further research is needed to create more intelligent AI agents. It also offers a standard benchmark for comparing different AI approaches in a complex, dynamic environment.
Abstract
Recent breakthroughs in Large Language Models (LLMs) have positioned them as a promising paradigm for agents, with long-term planning and decision-making emerging as core general-purpose capabilities for adapting to diverse scenarios and tasks. Real-time strategy (RTS) games serve as an ideal testbed for evaluating these two capabilities, as their inherent gameplay requires both macro-level strategic planning and micro-level tactical adaptation and action execution. Existing RTS game-based environments either suffer from relatively high computational demands or lack support for textual observations, which has constrained the use of RTS games for LLM evaluation. Motivated by this, we present TowerMind, a novel environment grounded in the tower defense (TD) subgenre of RTS games. TowerMind preserves the key evaluation strengths of RTS games for assessing LLMs, while featuring low computational demands and a multimodal observation space, including pixel-based, textual, and structured game-state representations. In addition, TowerMind supports the evaluation of model hallucination and provides a high degree of customizability. We design five benchmark levels to evaluate several widely used LLMs under different multimodal input settings. The results reveal a clear performance gap between LLMs and human experts across both capability and hallucination dimensions. The experiments further highlight key limitations in LLM behavior, such as inadequate planning validation, a lack of multifinality in decision-making, and inefficient action use. We also evaluate two classic reinforcement learning algorithms: Ape-X DQN and PPO. By offering a lightweight and multimodal design, TowerMind complements the existing RTS game-based environment landscape and introduces a new benchmark for the AI agent field. The source code is publicly available on GitHub(https://github.com/tb6147877/TowerMind).