BayesianVLA: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

Shijie Lian, Bin Yu, Xiaopeng Lin, Laurence T. Yang, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Cong Huang, Kai Chen

2026-01-23

BayesianVLA: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

Summary

This paper investigates why robots using both vision and language instructions sometimes fail to follow those instructions, especially when faced with new situations. It proposes a new method, BayesianVLA, to help robots better understand and act on language commands.

What's the problem?

Current robot systems trained with vision and language often develop a shortcut. They learn to perform tasks based on what they *see* rather than what they are *told* to do. This happens because the training data unintentionally makes the language instructions very obvious just by looking at the scene. As a result, the robot essentially ignores the language part and just relies on its vision, which doesn't work well when things are different from what it's seen before – this is called 'Information Collapse'.

What's the solution?

The researchers created a new framework called BayesianVLA. It works by making the robot consider two possibilities: what it would do based on vision alone, and what it should do based on both vision *and* the language instruction. The system then learns to prioritize actions that are specifically explained by the language, essentially forcing it to pay attention to what it's being told. This is done without needing to collect any new training data, just by changing how the robot learns from the data it already has.

Why it matters?

This research is important because it addresses a key limitation in robot learning. By preventing robots from ignoring language instructions, we can create systems that are more flexible, reliable, and able to handle unexpected situations. This is a big step towards robots that can truly assist us in complex, real-world tasks, and the 11.3% improvement on a challenging test shows it really works.

Abstract

Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose BayesianVLA, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior p(a mid v) and a language-conditioned posterior π(a mid v, ell). We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, BayesianVLA significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.

View Paper