Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
Ganlin Yang, Tianyi Zhang, Haoran Hao, Weiyun Wang, Yibin Liu, Dehui Wang, Guanzhou Chen, Zijian Cai, Junting Chen, Weijie Su, Wengang Zhou, Yu Qiao, Jifeng Dai, Jiangmiao Pang, Gen Luo, Wenhai Wang, Yao Mu, Zhi Hou
2025-10-14
Summary
This research focuses on making robots smarter by connecting their ability to 'think' about what they see and hear with their ability to actually *do* things. It introduces a new model called Vlaser that helps bridge the gap between understanding instructions and carrying them out.
What's the problem?
Currently, there's a disconnect between how well robots can understand language and reason about the world (thanks to advanced AI models) and how well they can actually perform physical tasks. Existing systems often treat the 'thinking' and 'doing' parts separately, leading to robots that struggle to translate instructions into actions. The data used to train these 'thinking' models is very different from the data needed to train robots to move and interact with objects, creating a challenge when trying to combine them.
What's the solution?
The researchers created Vlaser, a new Vision-Language-Action model, specifically designed to connect reasoning and action. They also built a large dataset called Vlaser-6M to train it. They then experimented with different ways to start the training process, figuring out how to best adapt the 'thinking' part of the model to the 'doing' part. This involved carefully fine-tuning the model using data specifically for robot control.
Why it matters?
This work is important because it represents a step towards creating robots that can reliably follow complex instructions in real-world environments. By improving the connection between reasoning and action, robots can become more helpful and versatile, potentially assisting with tasks in homes, workplaces, and beyond. The insights gained about how to train these models can also be applied to other robotic systems, leading to further advancements in the field.
Abstract
While significant research has focused on developing embodied reasoning capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs into Vision-Language-Action (VLA) models for end-to-end robot control, few studies directly address the critical gap between upstream VLM-based reasoning and downstream VLA policy learning. In this work, we take an initial step toward bridging embodied reasoning with VLA policy learning by introducing Vlaser - a Vision-Language-Action Model with synergistic embodied reasoning capability, which is a foundational vision-language model designed to integrate high-level reasoning with low-level control for embodied agents. Built upon the high-quality Vlaser-6M dataset, Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks - including spatial reasoning, embodied grounding, embodied QA, and task planning. Furthermore, we systematically examine how different VLM initializations affect supervised VLA fine-tuning, offering novel insights into mitigating the domain shift between internet-scale pre-training data and embodied-specific policy learning data. Based on these insights, our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.