RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models
Dongyoung Kim, Sumin Park, Woomin Song, Seungku Kim, Taeyoung Kim, Huiwon Jang, Jinwoo Shin, Jaehyung Kim, Younggyo Seo
2026-03-24
Summary
This paper focuses on improving how well large language models that understand both images and text can be used to control robots, essentially making them better at translating understanding into physical actions.
What's the problem?
Current methods for teaching these models to perform actions are often unreliable and don't lead to significant improvements in a robot's ability to complete tasks. While researchers have tried to improve these models by having them answer questions about what they 'see', this hasn't consistently resulted in better robot performance, and sometimes even makes things worse.
What's the solution?
The researchers developed a new training method called RoboAlign. This method first uses the model's existing language skills to predict what actions a robot should take, and then uses reinforcement learning – a technique where the model learns through trial and error – to refine those action predictions and make them more accurate. This helps the model better connect language instructions with the specific movements a robot needs to make.
Why it matters?
This work is important because it provides a more reliable way to build robots that can understand and respond to instructions. By improving the connection between what the model understands and what the robot does, RoboAlign allows for significant performance gains on challenging robotics tasks, even with a relatively small amount of training data, bringing us closer to robots that can effectively operate in the real world.
Abstract
Improving embodied reasoning in multimodal-large-language models (MLLMs) is essential for building vision-language-action models (VLAs) on top of them to readily translate multimodal understanding into low-level actions. Accordingly, recent work has explored enhancing embodied reasoning in MLLMs through supervision of vision-question-answering type. However, these approaches have been reported to result in unstable VLA performance, often yielding only marginal or even negative gains. In this paper, we propose a more systematic MLLM training framework RoboAlign that reliably improves VLA performance. Our key idea is to sample action tokens via zero-shot natural language reasoning and refines this reasoning using reinforcement learning (RL) to improve action accuracy. As a result, RoboAlign bridges the modality gap between language and low-level actions in MLLMs, and facilitate knowledge transfer from MLLM to VLA. To validate the effectiveness of RoboAlign, we train VLAs by adding a diffusion-based action head on top of an MLLM backbone and evaluate them on major robotics benchmarks. Remarkably, by performing RL-based alignment after SFT using less than 1\% of the data, RoboAlign achieves performance improvements of 17.5\%, 18.9\%, and 106.6\% over SFT baselines on LIBERO, CALVIN, and real-world environments, respectively.