HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Tianshuo Yang, Guanyu Chen, Yutian Chen, Zhixuan Liang, Yitian Liu, Zanxin Chen, Chunpu Xu, Haotian Liang, Jiangmiao Pang, Yao Mu, Ping Luo

2026-04-17

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Summary

This paper introduces a new system called HiVLA that helps robots understand instructions and perform complex tasks involving objects, combining vision, language, and action.

What's the problem?

Current robots using 'end-to-end' systems, where everything is learned together, struggle with a trade-off. When you teach them specific actions, they often lose their ability to generally understand instructions and reason about tasks. Basically, making them good at *doing* something makes them worse at *thinking* about what to do.

What's the solution?

HiVLA solves this by separating the 'thinking' part from the 'doing' part. First, a powerful language model plans the task, breaking it down into smaller steps and identifying the objects involved. Then, a separate system, using a technique called a Diffusion Transformer, focuses solely on *how* to physically move and manipulate those objects. This 'doing' system gets information from the 'thinking' system, but can be improved independently, meaning you can make the robot better at specific actions without messing up its overall understanding.

Why it matters?

This is important because it allows robots to be both smart and skillful. They can understand complex instructions and adapt to new situations, while also being able to reliably perform the physical actions needed to complete tasks, even with small objects in messy environments. This is a big step towards robots that can truly help us in the real world.

Abstract

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.

View Paper