iFlyBot-VLA Technical Report
Yuan Zhang, Chenyu Xue, Wenjie Xu, Chao Ji, Jiajia wu, Jia Pan
2025-11-05
Summary
This paper introduces iFlyBot-VLA, a new artificial intelligence model that can understand both images and language to perform physical actions, like a robot doing tasks. It's designed to be better at figuring out *what* to do and *how* to do it, combining high-level goals with the specific movements needed to achieve them.
What's the problem?
Current robots often struggle to understand instructions given in everyday language and translate those into effective actions. They either lack a good understanding of the world around them, or they can't connect what someone *wants* to happen with the precise motor controls needed to make it happen. Existing models often treat vision, language, and action as separate things, making it hard for a robot to reason about tasks in a human-like way.
What's the solution?
The researchers created iFlyBot-VLA by training a model on a huge amount of video showing humans and robots performing tasks. They used a clever system where the model learns two kinds of action information: a general idea of the goal (like 'pick up the object') and the specific movements needed to achieve it. They also trained the model using a mix of data, including robot movement recordings and question-answering datasets, to improve its understanding of both the physical world and language. This combined approach helps the model connect what it 'sees' and 'hears' with what it needs to 'do'.
Why it matters?
This work is important because it brings us closer to robots that can truly assist humans in complex tasks. By better understanding language and vision, and by linking that understanding directly to action, iFlyBot-VLA can perform tasks more reliably and efficiently. The researchers are even sharing some of the data they created, which will help other scientists build even better robots in the future.
Abstract
We introduce iFlyBot-VLA, a large-scale Vision-Language-Action (VLA) model trained under a novel framework. The main contributions are listed as follows: (1) a latent action model thoroughly trained on large-scale human and robotic manipulation videos; (2) a dual-level action representation framework that jointly supervises both the Vision-Language Model (VLM) and the action expert during training; (3) a mixed training strategy that combines robot trajectory data with general QA and spatial QA datasets, effectively enhancing the 3D perceptual and reasoning capabilities of the VLM backbone. Specifically, the VLM is trained to predict two complementary forms of actions: latent actions, derived from our latent action model pretrained on cross-embodiment manipulation data, which capture implicit high-level intentions; and structured discrete action tokens, obtained through frequency-domain transformations of continuous control signals, which encode explicit low-level dynamics. This dual supervision aligns the representation spaces of language, vision, and action, enabling the VLM to directly contribute to action generation. Experimental results on the LIBERO Franka benchmark demonstrate the superiority of our frame-work, while real-world evaluations further show that iFlyBot-VLA achieves competitive success rates across diverse and challenging manipulation tasks. Furthermore, we plan to open-source a portion of our self-constructed dataset to support future research in the community