PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC
Haowei Liu, Xi Zhang, Haiyang Xu, Yuyang Wanyan, Junyang Wang, Ming Yan, Ji Zhang, Chunfeng Yuan, Changsheng Xu, Weiming Hu, Fei Huang
2025-02-21

Summary
This paper talks about LongWriter-V, a new system that helps AI models write very long and detailed text based on images and instructions. It improves how well AI can handle tasks that require generating thousands of words while staying accurate and connected to the input images.
What's the problem?
Current vision-language models struggle to write coherent and meaningful text when the output needs to be longer than 1,000 words. This happens because these models aren't trained with enough examples of long outputs, which limits their ability to handle complex, detailed tasks.
What's the solution?
The researchers created a large dataset called LongWriter-V-22k, which includes over 22,000 examples of tasks requiring outputs up to 10,000 words. They also developed a method called IterDPO that breaks long outputs into smaller parts for easier training and uses human feedback to improve the AI's performance. They tested their model on a new benchmark called MMLongBench-Write and showed that it outperformed larger, more expensive models like GPT-4o in generating high-quality long text.
Why it matters?
This matters because it enables AI to handle more complex writing tasks, like creating detailed reports or stories, which could be useful in fields like education, journalism, or professional documentation. By improving how AI generates long outputs while staying accurate, this research helps make AI more practical and reliable for real-world applications.
Abstract
In the field of MLLM-based GUI agents, compared to smartphones, the PC scenario not only features a more complex interactive environment, but also involves more intricate intra- and inter-app workflows. To address these issues, we propose a hierarchical agent framework named PC-Agent. Specifically, from the perception perspective, we devise an Active Perception Module (APM) to overcome the inadequate abilities of current MLLMs in perceiving screenshot content. From the decision-making perspective, to handle complex user instructions and interdependent subtasks more effectively, we propose a hierarchical multi-agent collaboration architecture that decomposes decision-making processes into Instruction-Subtask-Action levels. Within this architecture, three agents (i.e., Manager, Progress and Decision) are set up for instruction decomposition, progress tracking and step-by-step decision-making respectively. Additionally, a Reflection agent is adopted to enable timely bottom-up error feedback and adjustment. We also introduce a new benchmark PC-Eval with 25 real-world complex instructions. Empirical results on PC-Eval show that our PC-Agent achieves a 32% absolute improvement of task success rate over previous state-of-the-art methods. The code will be publicly available.