FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

Jaewoo Ahn, Junseo Kim, Heeseung Yun, Jaehyeon Son, Dongmin Park, Jaewoong Cho, Gunhee Kim

2025-09-03

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

Summary

This paper explores how well artificial intelligence agents, specifically those powered by large language models, can play adventure video games. It focuses on the difficulty these agents have with games that require remembering information and using it later on to complete a story.

What's the problem?

Current AI benchmarks for testing agents in games aren't very good because they don't test the agent's ability to finish *entire* games, just small parts. Adventure games are particularly challenging because they rely on complex stories and require the AI to remember clues and events from earlier in the game to solve puzzles and progress. There's a gap between what an agent *sees* happening in the game and what it *does* with that information over time.

What's the solution?

The researchers created a new benchmark called FlashAdventure, which includes 34 older, Flash-based adventure games designed to test if an AI can complete full storylines. They also developed a way to automatically judge how well the AI is doing, called CUA-as-a-Judge, and a new AI framework called COAST that helps the agent remember important clues for longer periods. COAST is designed to help the AI plan better and solve problems that require remembering past events.

Why it matters?

This research shows that current AI agents still struggle with completing complex, story-driven games. While COAST improves performance, there's still a significant difference between how well humans play these games compared to the best AI agents. This highlights the need for further research to create AI that can truly understand and interact with complex digital environments like video games, and ultimately, the real world.

Abstract

GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap: the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.

View Paper