VLANeXt: Recipes for Building Strong VLA Models
Xiao-Ming Wu, Bin Fan, Kang Liao, Jian-Jian Jiang, Runze Yang, Yihang Luo, Zhonghua Wu, Wei-Shi Zheng, Chen Change Loy
2026-02-24
Summary
This paper investigates how to best build Vision-Language-Action models, which are AI systems that can understand both images and language to then perform actions in the real world.
What's the problem?
Currently, there are many different ways researchers are building these models, but it's hard to know which approaches are actually effective because everyone uses different methods for training and testing. This makes it difficult to compare models and figure out what really makes a good VLA.
What's the solution?
The researchers created a standard framework and testing environment, then systematically tested different components and design choices within VLAs. They started with a basic model and then changed one thing at a time to see how it affected performance. Through this process, they identified twelve key things that contribute to a strong VLA, and used these insights to build a new, improved model called VLANeXt.
Why it matters?
This work provides a clear guide for building better VLAs, offering a 'recipe' for success. The researchers are also sharing their code so other scientists can easily reproduce their results and continue to improve these models, ultimately helping to create more capable and reliable AI systems that can interact with the world around us.
Abstract
Following the rise of large foundation models, Vision-Language-Action models (VLAs) emerged, leveraging strong visual and language understanding for general-purpose policy learning. Yet, the current VLA landscape remains fragmented and exploratory. Although many groups have proposed their own VLA models, inconsistencies in training protocols and evaluation settings make it difficult to identify which design choices truly matter. To bring structure to this evolving space, we reexamine the VLA design space under a unified framework and evaluation setup. Starting from a simple VLA baseline similar to RT-2 and OpenVLA, we systematically dissect design choices along three dimensions: foundational components, perception essentials, and action modelling perspectives. From this study, we distill 12 key findings that together form a practical recipe for building strong VLA models. The outcome of this exploration is a simple yet effective model, VLANeXt. VLANeXt outperforms prior state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks and demonstrates strong generalization in real-world experiments. We will release a unified, easy-to-use codebase that serves as a common platform for the community to reproduce our findings, explore the design space, and build new VLA variants on top of a shared foundation.