InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction

Bin Lei, Weitai Kang, Zijian Zhang, Winson Chen, Xi Xie, Shan Zuo, Mimi Xie, Ali Payani, Mingyi Hong, Yan Yan, Caiwen Ding

2025-05-27

InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer
Interaction

Summary

This paper talks about InfantAgent-Next, a new kind of AI that can understand and use both visual information and digital tools to interact with computers in many different ways. It’s designed to handle a wide range of computer tasks by combining models that can see and models that can use tools, all in a flexible system.

What's the problem?

The problem is that most AI agents are usually good at only one thing, like either recognizing images or using software tools, but not both at the same time. This makes it hard for them to solve more complex tasks that need both skills, especially when working with different types of software or computer environments.

What's the solution?

The authors created InfantAgent-Next, which brings together vision models and tool-using models in a modular way, meaning the system can easily switch between or combine different abilities. They tested it on several tough benchmarks that require both seeing and interacting with computer systems, and it was able to solve a variety of tasks.

Why it matters?

This is important because it shows that AI can become more versatile and useful in real-world computer tasks, like helping with software, navigating operating systems, or even assisting in coding. It brings us closer to having digital assistants that can handle almost any computer-related job.

Abstract

InfantAgent-Next is a multimodal agent that integrates tool-based and vision models in a modular architecture to solve various benchmarks, including OSWorld, GAIA, and SWE-Bench.

View Paper