InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation

Shuai Yang, Hao Li, Yilun Chen, Bin Wang, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, Jiangmiao Pang

2025-08-05

InstructVLA: Vision-Language-Action Instruction Tuning from
Understanding to Manipulation

Summary

This paper talks about InstructVLA, a type of AI model that combines vision, language, and action in one system to help robots understand instructions and perform tasks in the real world better.

What's the problem?

The problem is that most existing models either focus on understanding language and images or on performing actions, but not both together. This makes robots less flexible because they can lose their ability to reason when learning how to act.

What's the solution?

InstructVLA solves this by using a special training method that keeps the model's ability to understand language and images while also teaching it how to perform actions. It trains on both language and action data at the same time using a mixture-of-experts technique that balances these skills.

Why it matters?

This matters because it allows robots to follow complex instructions more effectively and perform tasks that need both understanding and precise actions, improving human-robot interaction and making robots more useful in everyday life.

Abstract

InstructVLA is an end-to-end vision-language-action model that enhances manipulation performance while preserving vision-language reasoning through multimodal training and mixture-of-experts adaptation.

View Paper