CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

Wei Li, Renshan Zhang, Rui Shao, Jie He, Liqiang Nie

2025-08-29

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

Summary

This paper introduces CogVLA, a new system for understanding instructions involving both vision and action, like telling a robot what to do with objects it sees.

What's the problem?

Current Vision-Language-Action (VLA) models are really good, but they require a lot of extra training after they've already been initially built, which takes a huge amount of computing power and makes them hard to use in real-world situations or scale up for more complex tasks. Basically, they're too slow and expensive to train and run.

What's the solution?

CogVLA tackles this problem by working more like how humans process information. It uses a three-step process: first, it focuses on the important parts of what the camera sees based on the instruction given. Second, it simplifies the language understanding by getting rid of unnecessary details. Finally, it makes sure the robot can still understand the connection between what it sees, the instruction, and the action it needs to take, using a special attention mechanism. This makes the system more efficient without sacrificing accuracy.

Why it matters?

CogVLA is important because it significantly improves the speed and reduces the cost of training and using VLA models. It achieves better results than previous models on standard tests and even works well with real robots, opening the door for more practical applications of AI in robotics and other fields where understanding vision and language together is crucial.

Abstract

Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and deployment.We propose CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance. CogVLA draws inspiration from human multimodal coordination and introduces a 3-stage progressive architecture. 1) Encoder-FiLM based Aggregation Routing (EFA-Routing) injects instruction information into the vision encoder to selectively aggregate and compress dual-stream visual tokens, forming a instruction-aware latent representation. 2) Building upon this compact visual encoding, LLM-FiLM based Pruning Routing (LFP-Routing) introduces action intent into the language model by pruning instruction-irrelevant visually grounded tokens, thereby achieving token-level sparsity. 3) To ensure that compressed perception inputs can still support accurate and coherent action generation, we introduce V-L-A Coupled Attention (CAtten), which combines causal vision-language attention with bidirectional action parallel decoding. Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5-fold and decreasing inference latency by 2.8-fold compared to OpenVLA. CogVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/CogVLA.

View Paper