Game-TARS: Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents
Zihao Wang, Xujing Li, Yining Ye, Junjie Fang, Haoming Wang, Longxiang Liu, Shihao Liang, Junting Lu, Zhiyong Wu, Jiazhan Feng, Wanjun Zhong, Zili Li, Yu Wang, Yu Miao, Bo Zhou, Yuanfan Li, Hao Wang, Zhongkai Zhao, Faming Wu, Zhengxuan Jiang, Weihao Tan, Heyuan Yao
2025-10-29
Summary
This paper introduces Game-TARS, a new artificial intelligence agent designed to be good at playing a wide variety of games, and even using computers in general, much like a person would.
What's the problem?
Existing game-playing AI often struggles to adapt to new games or environments because they rely on very specific ways of interacting with the game, like using a game's internal code (API) or simulating mouse clicks and keyboard presses in a limited way. This makes it hard to train an AI that can handle the huge diversity of games and computer tasks out there, and it limits how well they can learn from experience across different situations.
What's the solution?
The researchers created Game-TARS, which uses a more natural approach by directly controlling the game with standard keyboard and mouse inputs, just like a human player. They then trained this agent on a massive amount of game data – over 500 billion pieces of information – from many different types of games, including computer operating systems, web-based games, and simulated environments. To help the AI learn effectively, they used a technique that prioritizes recent learning and another that helps it focus its reasoning without getting bogged down in unnecessary details.
Why it matters?
Game-TARS performs significantly better than previous AI models on challenging tasks like open-world Minecraft and even surpasses powerful language models like GPT-5 and Gemini on first-person shooter games. This shows that using a simple, human-like control scheme combined with large-scale training is a promising way to build AI agents that can handle a wide range of computer-based tasks, bringing us closer to creating truly general-purpose AI.
Abstract
We present Game-TARS, a generalist game agent trained with a unified, scalable action space anchored to human-aligned native keyboard-mouse inputs. Unlike API- or GUI-based approaches, this paradigm enables large-scale continual pre-training across heterogeneous domains, including OS, web, and simulation games. Game-TARS is pre-trained on over 500B tokens with diverse trajectories and multimodal data. Key techniques include a decaying continual loss to reduce causal confusion and an efficient Sparse-Thinking strategy that balances reasoning depth and inference cost. Experiments show that Game-TARS achieves about 2 times the success rate over the previous sota model on open-world Minecraft tasks, is close to the generality of fresh humans in unseen web 3d games, and outperforms GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet in FPS benchmarks. Scaling results on training-time and test-time confirm that the unified action space sustains improvements when scaled to cross-game and multimodal data. Our results demonstrate that simple, scalable action representations combined with large-scale pre-training provide a promising path toward generalist agents with broad computer-use abilities.