WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

Hao Bai, Alexey Taymanov, Tong Zhang, Aviral Kumar, Spencer Whitehead

2026-01-07

WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

Summary

This paper introduces WebGym, a large and diverse set of tasks designed to train AI agents to interact with real websites, and demonstrates how training on this environment significantly improves an agent's ability to complete tasks on websites it has never seen before.

What's the problem?

Training AI agents to reliably use websites is difficult because websites are constantly changing and incredibly varied. Existing methods often use artificial or limited task sets, which don't prepare agents for the real world. It's hard to create a training environment that accurately reflects the complexity and ever-changing nature of the internet, and scaling up training is computationally expensive.

What's the solution?

The researchers created WebGym, which includes almost 300,000 different tasks on real websites, and developed a system to speed up the process of the agent learning by interacting with these websites. They used a standard reinforcement learning approach, where the agent learns from its own experiences, and they improved the speed of gathering those experiences. They then trained a powerful AI model, Qwen-3-VL-8B-Instruct, using WebGym.

Why it matters?

This work is important because it shows that it's possible to build AI agents that can effectively use the web, even on websites they haven't been specifically trained on. The trained agent actually performed *better* than agents powered by closed-source models like GPT-4o and GPT-5-Thinking, demonstrating the value of training on a large, realistic, and diverse dataset like WebGym. This could lead to more useful and adaptable AI assistants that can help people with tasks online.

Abstract

We present WebGym, the largest-to-date open-source environment for training realistic visual web agents. Real websites are non-stationary and diverse, making artificial or small-scale task sets insufficient for robust policy learning. WebGym contains nearly 300,000 tasks with rubric-based evaluations across diverse, real-world websites and difficulty levels. We train agents with a simple reinforcement learning (RL) recipe, which trains on the agent's own interaction traces (rollouts), using task rewards as feedback to guide learning. To enable scaling RL, we speed up sampling of trajectories in WebGym by developing a high-throughput asynchronous rollout system, designed specifically for web agents. Our system achieves a 4-5x rollout speedup compared to naive implementations. Second, we scale the task set breadth, depth, and size, which results in continued performance improvement. Fine-tuning a strong base vision-language model, Qwen-3-VL-8B-Instruct, on WebGym results in an improvement in success rate on an out-of-distribution test set from 26.2% to 42.9%, significantly outperforming agents based on proprietary models such as GPT-4o and GPT-5-Thinking that achieve 27.1% and 29.8%, respectively. This improvement is substantial because our test set consists only of tasks on websites never seen during training, unlike many other prior works on training visual web agents.

View Paper