Robix: A Unified Model for Robot Interaction, Reasoning and Planning

Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, Hang Li

2025-09-04

Robix: A Unified Model for Robot Interaction, Reasoning and Planning

Summary

This paper introduces Robix, a new artificial intelligence system designed to give robots more human-like thinking and communication skills. It's a single system that allows a robot to understand what you want, plan how to do it, and talk to you naturally while it's working.

What's the problem?

Currently, robots struggle with complex tasks that require understanding instructions, planning multiple steps, and adapting to unexpected situations or interruptions. Existing systems often handle these things separately, leading to robots that aren't very flexible or easy to interact with. They also have trouble with common sense reasoning – things humans just *know* without being told.

What's the solution?

The researchers created Robix, which combines several AI techniques. It uses 'chain-of-thought' reasoning, meaning it breaks down problems into smaller steps. They trained Robix in three phases: first, they gave it a lot of general knowledge about the world and how objects relate to each other. Then, they showed it examples of how humans and robots interact and how to plan tasks. Finally, they used a technique called reinforcement learning to help Robix learn from its mistakes and improve its planning over time. This allows Robix to generate both actions for the robot to perform and responses to say to people.

Why it matters?

Robix represents a significant step towards more capable and user-friendly robots. It outperforms other existing AI systems, including very advanced ones like GPT-4o and Gemini, in real-world tasks like cleaning up a table or helping with grocery shopping. This means we're getting closer to robots that can truly assist us in everyday life, understand our needs, and handle unexpected situations without constant human intervention.

Abstract

We introduce Robix, a unified model that integrates robot reasoning, task planning, and natural language interaction within a single vision-language architecture. Acting as the high-level cognitive layer in a hierarchical robot system, Robix dynamically generates atomic commands for the low-level controller and verbal responses for human interaction, enabling robots to follow complex instructions, plan long-horizon tasks, and interact naturally with human within an end-to-end framework. Robix further introduces novel capabilities such as proactive dialogue, real-time interruption handling, and context-aware commonsense reasoning during task execution. At its core, Robix leverages chain-of-thought reasoning and adopts a three-stage training strategy: (1) continued pretraining to enhance foundational embodied reasoning abilities including 3D spatial understanding, visual grounding, and task-centric reasoning; (2) supervised finetuning to model human-robot interaction and task planning as a unified reasoning-action sequence; and (3) reinforcement learning to improve reasoning-action consistency and long-horizon task coherence. Extensive experiments demonstrate that Robix outperforms both open-source and commercial baselines (e.g., GPT-4o and Gemini 2.5 Pro) in interactive task execution, demonstrating strong generalization across diverse instruction types (e.g., open-ended, multi-stage, constrained, invalid, and interrupted) and various user-involved tasks such as table bussing, grocery shopping, and dietary filtering.

View Paper