Can Agent Conquer Web? Exploring the Frontiers of ChatGPT Atlas Agent in Web Games

Jingran Zhang, Ning Li, Justin Cui

2025-10-31

Can Agent Conquer Web? Exploring the Frontiers of ChatGPT Atlas Agent in Web Games

Summary

This paper investigates how well OpenAI's ChatGPT Atlas, a version of ChatGPT that can interact with websites, can play web-based games.

What's the problem?

ChatGPT is good at understanding information on the internet, but it's unclear how well it can *do* things on websites, especially in situations that require quick reactions and precise movements like playing games. The researchers wanted to see if Atlas could handle both thinking-based games and games that need fast reflexes.

What's the solution?

The researchers tested Atlas on four different browser games: T-Rex Runner, Sudoku, Flappy Bird, and Stein.world. They measured how well Atlas did in each game, using the game's scoring system to see how it compared to a human player. They specifically looked at how Atlas performed on tasks needing logic versus those needing quick timing.

Why it matters?

The results show Atlas is surprisingly good at games that require logical thinking, like Sudoku, even beating humans. However, it really struggles with games that need precise timing and control, like Flappy Bird. This tells us that while Atlas can understand and process information from the web effectively, it still has trouble with real-time interactions and physical actions online, which is important for building AI that can truly assist us with tasks on the internet.

Abstract

OpenAI's ChatGPT Atlas introduces new capabilities for web interaction, enabling the model to analyze webpages, process user intents, and execute cursor and keyboard inputs directly within the browser. While its capacity for information retrieval tasks has been demonstrated, its performance in dynamic, interactive environments remains less explored. In this study, we conduct an early evaluation of Atlas's web interaction capabilities using browser-based games as test scenarios, including Google's T-Rex Runner, Sudoku, Flappy Bird, and Stein.world. We employ in-game performance scores as quantitative metrics to assess performance across different task types. Our results show that Atlas performs strongly in logical reasoning tasks like Sudoku, completing puzzles significantly faster than human baselines, but struggles substantially in real-time games requiring precise timing and motor control, often failing to progress beyond initial obstacles. These findings suggest that while Atlas demonstrates capable analytical processing, there remain notable limitations in dynamic web environments requiring real-time interaction. The website of our project can be found at https://atlas-game-eval.github.io.

View Paper