OmniGAIA: Towards Native Omni-Modal AI Agents

Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Shijian Wang, Guanting Dong, Jiajie Jin, Hao Wang, Yinuo Wang, Ji-Rong Wen, Yuan Lu, Zhicheng Dou

2026-02-27

OmniGAIA: Towards Native Omni-Modal AI Agents

Summary

This paper introduces a new way to test and build AI systems that can understand and interact with the world using multiple senses – sight, sound, and language – all at once, and also use tools to help them.

What's the problem?

Current AI models, even the advanced ones like large language models, usually only handle two types of information at a time, like images and text. They struggle to combine information from all our senses (like a video with sound and spoken instructions) and use tools to complete complex tasks, which is something humans do naturally. This limits their ability to be truly helpful assistants in real-world situations.

What's the solution?

The researchers created OmniGAIA, a challenging set of tests that requires AI to reason across video, audio, and images, and to use tools. They also built OmniAtlas, an AI agent designed to excel at these tests. OmniAtlas learns by practicing on simulated scenarios and getting feedback on its mistakes, improving its ability to use tools effectively. It builds upon existing open-source AI models, making them more capable.

Why it matters?

This work is a step towards creating more advanced AI assistants that can understand the world as we do, combining information from different senses and using tools to solve problems. This could lead to AI that is much more helpful and versatile in everyday life, capable of handling complex tasks in real-world environments.

Abstract

Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to interact with the world. However, current multi-modal LLMs are primarily confined to bi-modal interactions (e.g., vision-language), lacking the unified cognitive capabilities required for general AI assistants. To bridge this gap, we introduce OmniGAIA, a comprehensive benchmark designed to evaluate omni-modal agents on tasks necessitating deep reasoning and multi-turn tool execution across video, audio, and image modalities. Constructed via a novel omni-modal event graph approach, OmniGAIA synthesizes complex, multi-hop queries derived from real-world data that require cross-modal reasoning and external tool integration. Furthermore, we propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception. Trained on trajectories synthesized via a hindsight-guided tree exploration strategy and OmniDPO for fine-grained error correction, OmniAtlas effectively enhances the tool-use capabilities of existing open-source models. This work marks a step towards next-generation native omni-modal AI assistants for real-world scenarios.

View Paper