Surfer 2: The Next Generation of Cross-Platform Computer Use Agents

Mathieu Andreux, Märt Bakler, Yanael Barbier, Hamza Benchekroun, Emilien Biré, Antoine Bonnet, Riaz Bordie, Nathan Bout, Matthias Brunel, Aleix Cambray, Pierre-Louis Cedoz, Antoine Chassang, Gautier Cloix, Ethan Connelly, Alexandra Constantinou, Ramzi De Coster, Hubert de la Jonquiere, Aurélien Delfosse, Maxime Delpit, Alexis Deprez, Augustin Derupti, Mathieu Diaz

2025-10-31

Surfer 2: The Next Generation of Cross-Platform Computer Use Agents

Summary

This paper introduces Surfer 2, a new system for creating AI agents that can perform tasks across different digital environments like websites, desktop computers, and mobile phones.

What's the problem?

Currently, building AI agents that work well on different platforms is difficult because most systems are designed for a specific environment and can't easily adapt to others. This means an agent trained for a website won't necessarily work on a desktop application, limiting their usefulness.

What's the solution?

Surfer 2 solves this by creating an agent that relies *only* on what it sees on the screen – visual observations. It uses a clever design with three main parts: it keeps track of important context, separates the thinking (planning) from the doing (execution), and constantly checks its work, recovering if something goes wrong. This allows it to handle complex tasks that take a long time to complete.

Why it matters?

This work is important because it shows that we can build more general-purpose AI agents that aren't limited to a single platform. Surfer 2 performs exceptionally well on several tests, even surpassing human performance with multiple tries, demonstrating the power of combining a good system design with strong underlying AI models. It also points to the need for even better AI models in the future to make these agents even more efficient.

Abstract

Building agents that generalize across web, desktop, and mobile environments remains an open challenge, as prior systems rely on environment-specific interfaces that limit cross-platform deployment. We introduce Surfer 2, a unified architecture operating purely from visual observations that achieves state-of-the-art performance across all three environments. Surfer 2 integrates hierarchical context management, decoupled planning and execution, and self-verification with adaptive recovery, enabling reliable operation over long task horizons. Our system achieves 97.1% accuracy on WebVoyager, 69.6% on WebArena, 60.1% on OSWorld, and 87.1% on AndroidWorld, outperforming all prior systems without task-specific fine-tuning. With multiple attempts, Surfer 2 exceeds human performance on all benchmarks. These results demonstrate that systematic orchestration amplifies foundation model capabilities and enables general-purpose computer control through visual interaction alone, while calling for a next-generation vision language model to achieve Pareto-optimal cost-efficiency.

View Paper