OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning
Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yuzheng Zhuang, Bowen Yang, He Zhu, Lingfeng Zhang, Pengwei Xie, David Gamaliel Arcos Bravo, Yingxue Zhang, Jianye Hao, Xingyue Quan
2025-09-12
Summary
This paper introduces OmniEVA, a new system designed to make robots smarter at understanding and interacting with the world around them. It builds on recent advances in AI models that can process both text and visual information, aiming to improve how robots plan and carry out tasks in real-world environments.
What's the problem?
Current AI systems for robots struggle with two main issues. First, they often don't fully grasp 3D space, either because they're only trained on 2D images or because their understanding of 3D is too rigid. This makes it hard for them to adapt to new and different spatial situations. Second, these systems often ignore the physical limitations of the robot itself – like how far it can reach or what it can lift – leading to plans that sound good on paper but the robot can't actually execute.
What's the solution?
The researchers developed OmniEVA, which tackles these problems with two key ideas. First, it uses a 'Task-Adaptive 3D Grounding' mechanism that smartly decides when and how to use 3D information, focusing on what's relevant for the specific task at hand. Think of it like the AI focusing its attention on the important parts of the 3D environment. Second, it incorporates an 'Embodiment-Aware Reasoning' framework that considers both the task goals *and* the robot's physical abilities during the planning process, ensuring the plans are actually possible for the robot to carry out.
Why it matters?
This work is important because it represents a significant step towards creating robots that can reliably perform complex tasks in the real world. By improving the robot’s spatial understanding and its ability to account for its own limitations, OmniEVA paves the way for more versatile and practical robots that can assist us in a wider range of situations.
Abstract
Recent advances in multimodal large language models (MLLMs) have opened new opportunities for embodied intelligence, enabling multimodal understanding, reasoning, and interaction, as well as continuous spatial decision-making. Nevertheless, current MLLM-based embodied systems face two critical limitations. First, Geometric Adaptability Gap: models trained solely on 2D inputs or with hard-coded 3D geometry injection suffer from either insufficient spatial information or restricted 2D generalization, leading to poor adaptability across tasks with diverse spatial demands. Second, Embodiment Constraint Gap: prior work often neglects the physical constraints and capacities of real robots, resulting in task plans that are theoretically valid but practically infeasible.To address these gaps, we introduce OmniEVA -- an embodied versatile planner that enables advanced embodied reasoning and task planning through two pivotal innovations: (1) a Task-Adaptive 3D Grounding mechanism, which introduces a gated router to perform explicit selective regulation of 3D fusion based on contextual requirements, enabling context-aware 3D grounding for diverse embodied tasks. (2) an Embodiment-Aware Reasoning framework that jointly incorporates task goals and embodiment constraints into the reasoning loop, resulting in planning decisions that are both goal-directed and executable. Extensive experimental results demonstrate that OmniEVA not only achieves state-of-the-art general embodied reasoning performance, but also exhibits a strong ability across a wide range of downstream scenarios. Evaluations of a suite of proposed embodied benchmarks, including both primitive and composite tasks, confirm its robust and versatile planning capabilities. Project page: https://omnieva.github.io