VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting

Xiaoyu Liu, Chaoyou Fu, Chi Yan, Chu Wu, Haihan Gao, Yi-Fan Zhang, Shaoqi Dong, Cheng Qian, Bin Luo, Xiuyong Yang, Guanwu Li, Yusheng Cai, Yunhang Shen, Deqiang Jiang, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He

2025-10-28

VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting

Summary

This paper introduces a new system called VITA-E that aims to make robots and virtual assistants more responsive and capable of handling multiple tasks at once, much like a human can.

What's the problem?

Current robots and AI assistants struggle with doing multiple things simultaneously, like listening to you, responding, and acting in the real world all at the same time. They also have trouble dealing with unexpected interruptions – if you try to change their instructions mid-task, they often get confused or stop working. This makes interacting with them feel unnatural and frustrating because they aren't flexible like a person would be.

What's the solution?

The researchers created VITA-E, which uses two AI 'brains' working in parallel. One 'brain' is actively performing a task, while the other is on standby, ready to take over if there's an interruption or a new instruction. They also taught the AI to use special codes that directly control the robot's actions, making the connection between thinking and doing much faster. This allows the robot to listen, speak, and act all at the same time, and quickly respond to changes in plans.

Why it matters?

This work is a big step towards creating AI assistants that are truly helpful and easy to interact with. By allowing robots to handle multiple tasks and interruptions seamlessly, VITA-E makes them more reliable, responsive, and capable of working alongside people in real-world situations, ultimately leading to more natural and effective collaboration.

Abstract

Current Vision-Language-Action (VLA) models are often constrained by a rigid, static interaction paradigm, which lacks the ability to see, hear, speak, and act concurrently as well as handle real-time user interruptions dynamically. This hinders seamless embodied collaboration, resulting in an inflexible and unresponsive user experience. To address these limitations, we introduce VITA-E, a novel embodied interaction framework designed for both behavioral concurrency and nearly real-time interruption. The core of our approach is a dual-model architecture where two parallel VLA instances operate as an ``Active Model'' and a ``Standby Model'', allowing the embodied agent to observe its environment, listen to user speech, provide verbal responses, and execute actions, all concurrently and interruptibly, mimicking human-like multitasking capabilities. We further propose a ``model-as-controller'' paradigm, where we fine-tune the VLM to generate special tokens that serve as direct system-level commands, coupling the model's reasoning with the system's behavior. Experiments conducted on a physical humanoid platform demonstrate that VITA-E can reliably handle complex interactive scenarios. Our framework is compatible with various dual-system VLA models, achieving an extremely high success rate on emergency stops and speech interruptions while also successfully performing concurrent speech and action. This represents a significant step towards more natural and capable embodied assistants.

View Paper