A Pragmatic VLA Foundation Model

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, Yiyu Ren, Kejia Zhang, Hui Yu, Jingmei Zhao, Shuai Zhou, Zhenqi Qiu, Houlong Xiong, Ziyu Wang, Zechen Wang, Ran Cheng, Yong-Lu Li, Yongtao Huang

2026-01-28

Summary

This paper introduces LingBot-VLA, a new artificial intelligence model designed to help robots understand and follow instructions given in natural language, and then perform physical actions. It aims to be a strong, adaptable, and efficient foundation for robotic tasks.

What's the problem?

Currently, teaching robots to reliably perform a variety of tasks is difficult and expensive. It requires a lot of data and computing power to adapt existing models to new robots or tasks. Existing models often struggle to generalize well, meaning they don't work consistently across different situations or robot types. The goal is to create a model that overcomes these limitations, offering both strong performance and cost-effectiveness.

What's the solution?

The researchers created LingBot-VLA by training it on a massive dataset of about 20,000 hours of real-world robot data collected from nine different robot arm setups. They then thoroughly tested the model on three different robots, having each complete 100 different tasks with many trials per task. They also focused on making the underlying code very efficient, allowing for faster training and more samples to be processed per second compared to other similar projects.

Why it matters?

LingBot-VLA is important because it represents a significant step towards more practical and versatile robots. Its strong performance, ability to work with different robots, and efficient training process make it a promising foundation for real-world robotic applications. Furthermore, the researchers are making the model, code, and data publicly available, which will help accelerate further research and development in the field of robot learning and encourage better standards for evaluating these systems.

Abstract

Offering great potential in robotic manipulation, a capable Vision-Language-Action (VLA) foundation model is expected to faithfully generalize across tasks and platforms while ensuring cost efficiency (e.g., data and GPU hours required for adaptation). To this end, we develop LingBot-VLA with around 20,000 hours of real-world data from 9 popular dual-arm robot configurations. Through a systematic assessment on 3 robotic platforms, each completing 100 tasks with 130 post-training episodes per task, our model achieves clear superiority over competitors, showcasing its strong performance and broad generalizability. We have also built an efficient codebase, which delivers a throughput of 261 samples per second per GPU with an 8-GPU training setup, representing a 1.5~2.8times (depending on the relied VLM base model) speedup over existing VLA-oriented codebases. The above features ensure that our model is well-suited for real-world deployment. To advance the field of robot learning, we provide open access to the code, base model, and benchmark data, with a focus on enabling more challenging tasks and promoting sound evaluation standards.

View Paper