DiG-Flow: Discrepancy-Guided Flow Matching for Robust VLA Models

Wanpeng Zhang, Ye Wang, Hao Luo, Haoqi Yuan, Yicheng Feng, Sipeng Zheng, Qin Jin, Zongqing Lu

2025-12-03

DiG-Flow: Discrepancy-Guided Flow Matching for Robust VLA Models

Summary

This paper introduces a new method, DiG-Flow, to improve how well vision-language-action (VLA) models work when controlling robots. These models are good at understanding instructions and performing tasks, but they can struggle when things change slightly from what they were trained on, or when tasks are complicated and have many steps.

What's the problem?

VLA models, even though they're getting better at robotic tasks, often fail when faced with situations that are a little different from their training data. This means they aren't reliably understanding the core meaning of the task. Also, they have trouble with tasks that require a lot of different steps to complete, suggesting their internal understanding of the task isn't strong enough to handle complexity.

What's the solution?

DiG-Flow tackles this problem by focusing on how the model represents what it 'sees' (vision) and what it 'does' (action). It checks how similar these representations are. If the visual understanding and the planned action don't quite match up, DiG-Flow subtly adjusts the visual understanding to be more consistent with the action. It does this without changing the main way the model learns, just by fine-tuning the internal representations. The method uses a mathematical concept called 'transport cost' to measure this similarity and makes sure the adjustments always improve the learning process.

Why it matters?

This research is important because it makes VLA models more reliable and capable. By making them more robust to changes and better at handling complex tasks, we can build robots that are more adaptable and useful in real-world situations. It also achieves these improvements without requiring a lot of extra computing power or changes to existing model structures, making it easy to implement.

Abstract

Vision-Language-Action (VLA) models trained with flow matching have demonstrated impressive capabilities on robotic manipulation tasks. However, their performance often degrades under distribution shift and on complex multi-step tasks, suggesting that the learned representations may not robustly capture task-relevant semantics. We introduce DiG-Flow, a principled framework that enhances VLA robustness through geometric regularization. Our key insight is that the distributional discrepancy between observation and action embeddings provides a meaningful geometric signal: lower transport cost indicates compatible representations, while higher cost suggests potential misalignment. DiG-Flow computes a discrepancy measure between empirical distributions of observation and action embeddings, maps it to a modulation weight via a monotone function, and applies residual updates to the observation embeddings before flow matching. Crucially, this intervention operates at the representation level without modifying the flow matching path or target vector field. We provide theoretical guarantees showing that discrepancy-guided training provably decreases the training objective, and that guided inference refinement converges with contraction. Empirically, DiG-Flow integrates into existing VLA architectures with negligible overhead and consistently improves performance, with particularly pronounced gains on complex multi-step tasks and under limited training data.

View Paper