Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

Chengshuai Shi, Wenzhe Li, Xinran Liang, Yizhou Lu, Wenjia Yang, Ruirong Feng, Seth Karten, Ziran Yang, Zihan Ding, Gabriel Sarch, Danqi Chen, Karthik Narasimhan, Chi Jin

2026-05-04

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

Summary

This paper explores how to teach powerful vision-language models, which are good at understanding both images and text, to play video games like Super Mario Land. The goal is to get these models to make smart decisions over a long period of gameplay, requiring them to see, think, and act effectively.

What's the problem?

Teaching these models to play games is hard because existing methods either need a huge amount of example gameplay data from humans, or they only work well for short bursts of action. Super Mario Land is a particularly challenging game because it requires planning and consistent action over many turns – over 100! – to succeed. Simply applying existing techniques doesn't work well for this kind of long-term, visually complex task.

What's the solution?

The researchers improved a reinforcement learning technique called PPO, adding a way to evaluate each turn of the game to make the learning process more stable and efficient. They also found that starting with a model already trained to understand images and language gave it a big advantage, meaning it needed less practice to learn how to play. They created a framework called Odysseus to make it easier to train these game-playing models, and it significantly outperformed other approaches, achieving much better progress in the game.

Why it matters?

This work is important because it shows how to make reinforcement learning work effectively with vision-language models for complex, long-term tasks. It provides practical advice for building AI agents that can interact with the real world, not just perform simple tasks, and it opens the door to creating AI that can learn to play a wide variety of games and potentially solve other real-world problems requiring both visual understanding and strategic decision-making.

Abstract

Given the rapidly growing capabilities of vision-language models (VLMs), extending them to interactive decision-making tasks such as video games has emerged as a promising frontier. However, existing approaches either rely on large-scale supervised fine-tuning (SFT) on human trajectories or apply reinforcement learning (RL) only in relatively short-horizon settings (typically around 20--30 turns). In this work, we study RL-based training of VLMs for long-horizon decision-making in Super Mario Land, a visually grounded environment requiring 100+ turns of interaction with coordinated perception, reasoning, and action. We begin with a systematic investigation of key algorithmic components and propose an adapted variant of PPO with a lightweight turn-level critic, which substantially improves training stability and sample efficiency over critic-free methods such as GRPO and Reinforce++. We further show that pretrained VLMs provide strong action priors, significantly improving sample efficiency during RL training and reducing the need for manual design choices such as action engineering, compared to classical deep RL trained from scratch. Building on these insights, we introduce Odysseus, an open training framework for VLM agents, achieving substantial gains across multiple levels of the game and at least 3 times average game progresses than frontier models. Moreover, the trained models exhibit consistent improvements under both in-game and cross-game generalization settings, while maintaining general-domain capabilities. Overall, our results identify key ingredients for making RL stable and effective in long-horizon, multi-modal settings, and provide practical guidance for developing VLMs as embodied agents.

View Paper