Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning

Weijie Shen, Yitian Liu, Yuhao Wu, Zhixuan Liang, Sijia Gu, Dehui Wang, Tian Nian, Lei Xu, Yusen Qin, Jiangmiao Pang, Xinping Guan, Xiaokang Yang, Yao Mu

2025-10-17

Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning

Summary

This paper introduces a new way to build better AI systems for robots that understand both vision (what they see) and language (what they're told to do), and then use that understanding to perform actions. It focuses on making these systems bigger and more capable without requiring huge amounts of new data or slowing down the robot's response time.

What's the problem?

Building these vision-language-action (VLA) models is hard because training them from scratch needs a lot of computing power and data, which is especially scarce for robots. Also, making these models powerful enough to be useful while still being fast enough for real-time control is a tricky balancing act. Simply making the model bigger doesn't always work well and can be inefficient.

What's the solution?

The researchers developed a system called AdaMoE. It's based on an idea called 'Mixture of Experts,' where different parts of the model specialize in different tasks. AdaMoE starts with a pre-trained VLA model and expands it by replacing some of its core components with these specialized 'expert' parts. Importantly, AdaMoE doesn't just pick one expert to handle a task; it lets multiple experts contribute, each with its own level of influence, leading to better collaboration and performance. It uses a clever technique to decide which experts are relevant and how much each should contribute.

Why it matters?

This work is important because it shows how to build more effective robot control systems without needing massive amounts of new data or sacrificing speed. The improvements in both simulated and real-world robotic tasks, particularly the significant 21.5% improvement in real-world performance, demonstrate that AdaMoE is a practical solution for making robots more capable and adaptable.

Abstract

Vision-Language-Action (VLA) models are experiencing rapid development and demonstrating promising capabilities in robotic manipulation tasks. However, scaling up VLA models presents several critical challenges: (1) Training new VLA models from scratch demands substantial computational resources and extensive datasets. Given the current scarcity of robot data, it becomes particularly valuable to fully leverage well-pretrained VLA model weights during the scaling process. (2) Real-time control requires carefully balancing model capacity with computational efficiency. To address these challenges, We propose AdaMoE, a Mixture-of-Experts (MoE) architecture that inherits pretrained weights from dense VLA models, and scales up the action expert by substituting the feedforward layers into sparsely activated MoE layers. AdaMoE employs a decoupling technique that decouples expert selection from expert weighting through an independent scale adapter working alongside the traditional router. This enables experts to be selected based on task relevance while contributing with independently controlled weights, allowing collaborative expert utilization rather than winner-takes-all dynamics. Our approach demonstrates that expertise need not monopolize. Instead, through collaborative expert utilization, we can achieve superior performance while maintaining computational efficiency. AdaMoE consistently outperforms the baseline model across key benchmarks, delivering performance gains of 1.8% on LIBERO and 9.3% on RoboTwin. Most importantly, a substantial 21.5% improvement in real-world experiments validates its practical effectiveness for robotic manipulation tasks.

View Paper