MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots

Ting Huang, Dongjian Li, Rui Yang, Zeyu Zhang, Zida Yang, Hao Tang

2025-11-27

MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots

Summary

This paper focuses on getting robots, specifically four-legged ones, to follow spoken or written instructions. It's about making robots understand what we *want* them to do, not just *how* to do it, and then actually carry out those instructions smoothly.

What's the problem?

Currently, robots struggle to connect what humans say with the specific movements they need to make. Existing methods have trouble understanding the bigger picture of an instruction and translating it into precise actions, leading to jerky movements and difficulty adapting to new situations. Basically, they can't reliably take a high-level goal, like 'walk around the table,' and figure out all the little steps needed to actually do it successfully in the real world.

What's the solution?

The researchers created a system called MobileVLA-R1. This system uses a large dataset of examples showing how instructions relate to robot movements, and it breaks down those instructions into a series of logical steps – kind of like how a person thinks through a task. They then trained the robot in two stages: first, to understand the reasoning behind the instructions, and second, to use that reasoning to control its movements effectively using a technique called reinforcement learning. This helps the robot be more consistent, stable, and able to complete longer, more complex tasks.

Why it matters?

This work is important because it's a step towards making robots more helpful and versatile. If robots can truly understand and follow our instructions, they can be used in a wider range of situations, like assisting people in their homes, exploring dangerous environments, or performing complex tasks in factories. Improving the connection between language and action is key to unlocking the full potential of robotics.

Abstract

Grounding natural-language instructions into continuous control for quadruped robots remains a fundamental challenge in vision language action. Existing methods struggle to bridge high-level semantic reasoning and low-level actuation, leading to unstable grounding and weak generalization in the real world. To address these issues, we present MobileVLA-R1, a unified vision-language-action framework that enables explicit reasoning and continuous control for quadruped robots. We construct MobileVLA-CoT, a large-scale dataset of multi-granularity chain-of-thought (CoT) for embodied trajectories, providing structured reasoning supervision for alignment. Built upon this foundation, we introduce a two-stage training paradigm that combines supervised CoT alignment with GRPO reinforcement learning to enhance reasoning consistency, control stability, and long-horizon execution. Extensive evaluations on VLN and VLA tasks demonstrate superior performance over strong baselines, with approximately a 5% improvement. Real-world deployment on a quadruped robot validates robust performance in complex environments. Code: https://github.com/AIGeeksGroup/MobileVLA-R1. Website: https://aigeeksgroup.github.io/MobileVLA-R1.

View Paper