< Explain other AI papers

OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent

Bowen Yang, Kaiming Jin, Zhenyu Wu, Zhaoyang Liu, Qiushi Sun, Zehao Li, JingJing Xie, Zhoumianze Liu, Fangzhi Xu, Kanzhi Cheng, Qingyun Li, Yian Wang, Yu Qiao, Zun Wang, Zichen Ding

2026-01-13

OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent

Summary

This paper introduces a new system called OS-Symphony designed to make computer agents that use both vision (what they 'see') and language (what they 'understand') much better at completing complex tasks online.

What's the problem?

Current computer agents struggle when tasks take a long time to complete because they 'forget' important visual information from earlier steps. They also have trouble adapting to new websites or situations they haven't been specifically trained for, leading to errors and failures. Essentially, they lack the ability to learn and correct themselves effectively over extended periods and in unfamiliar environments.

What's the solution?

OS-Symphony tackles these issues with two main ideas. First, it uses a 'Reflection-Memory Agent' that remembers key moments in a task, allowing the agent to revisit and correct mistakes based on past visual experiences. Second, it includes 'Versatile Tool Agents' that can actively search the internet for visual tutorials when facing a new problem, essentially 'watching' how a human would solve it before attempting the task themselves. This 'SeeAct' approach helps the agent learn on the fly.

Why it matters?

This research is important because it represents a significant step towards creating more reliable and adaptable computer agents. These agents could eventually automate a wide range of online tasks, from booking travel to managing finances, making our digital lives easier and more efficient. The improvements shown in performance, particularly on challenging online benchmarks, demonstrate the potential of this new framework.

Abstract

While Vision-Language Models (VLMs) have significantly advanced Computer-Using Agents (CUAs), current frameworks struggle with robustness in long-horizon workflows and generalization in novel domains. These limitations stem from a lack of granular control over historical visual context curation and the absence of visual-aware tutorial retrieval. To bridge these gaps, we introduce OS-Symphony, a holistic framework that comprises an Orchestrator coordinating two key innovations for robust automation: (1) a Reflection-Memory Agent that utilizes milestone-driven long-term memory to enable trajectory-level self-correction, effectively mitigating visual context loss in long-horizon tasks; (2) Versatile Tool Agents featuring a Multimodal Searcher that adopts a SeeAct paradigm to navigate a browser-based sandbox to synthesize live, visually aligned tutorials, thereby resolving fidelity issues in unseen scenarios. Experimental results demonstrate that OS-Symphony delivers substantial performance gains across varying model scales, establishing new state-of-the-art results on three online benchmarks, notably achieving 65.84% on OSWorld.