DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action
Zhen Fang, Zhuoyang Liu, Jiaming Liu, Hao Chen, Yu Zeng, Shiting Huang, Zehui Chen, Lin Chen, Shanghang Zhang, Feng Zhao
2025-12-01
Summary
This paper introduces a new approach, called DualVLA, to building AI models that can understand both images and language, and then use that understanding to perform actions, like a robot completing a task. It focuses on creating a model that's good at *both* understanding instructions and actually carrying them out successfully.
What's the problem?
Typically, researchers try to build these 'Vision-Language-Action' (VLA) models by first training them to be really good at specific actions, like robot movements. Then, they try to broaden the model’s understanding by adding more general information. However, the researchers found that when they did this, the model actually got *worse* at performing the actions it was originally good at – this is what they call 'action degeneration'. Essentially, learning more general knowledge seemed to make it forget how to do things well.
What's the solution?
To fix this, the researchers developed DualVLA. This involves two main ideas. First, they carefully filtered the data used for broader learning, removing information that didn't help with actions and might actually confuse the model. Second, they used a special training technique where the model gets different types of guidance depending on whether it’s learning about actions or general reasoning, helping it to maintain both skills. They also created a new way to test these models, breaking down performance into reasoning, understanding intent, action execution, and how well the actions align with the instructions.
Why it matters?
This work is important because it addresses a key challenge in building truly useful AI assistants. It’s not enough for an AI to understand what you want; it also needs to be able to reliably *do* it. DualVLA shows a way to build models that are both intelligent and capable, achieving a better balance between understanding and action, and the new evaluation method provides a more detailed way to measure progress in this field.
Abstract
To build a generalizable Vision-Language-Action (VLA) model with strong reasoning ability, a common strategy is to first train a specialist VLA on robot demonstrations to acquire reliable manipulation skills, and then incorporate mixed annotated robot data together with multimodal data to restore broader reasoning capabilities. However, we observe that the resulting reasoning VLA often suffers from degraded action performance compared to the specialist model before fine-tuning, a phenomenon we refer to as action degeneration. To address this issue, we propose DualVLA, which enhances action performance through carefully designed post-training while still preserving reasoning capability. We first introduce a dual-layer data pruning method that removes redundant embodied reasoning, preventing it from adversely influencing action learning. To further strengthen action generation, we design a dual-teacher adaptive distillation strategy that assigns different supervision signals to different data domains while maintaining reasoning ability. To fill the evaluation gap for generalist VLAs, we also propose VLA Score, which decouples VLA capability into reasoning, intention, action, and alignment dimensions for a more fine-grained assessment. Experiments show that DualVLA achieves an average success rate of 61.0 in SimplerEnv and an average score of 65.4 across eight competitive multimodal benchmarks, demonstrating a stronger balance between precise action execution and multimodal understanding. Project Website: https://costaliya.github.io/DualVLA/.