OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Jinghui Lu, Jiayi Guan, Zhijian Huang, Jinlong Li, Guang Li, Lingdong Kong, Yingyan Li, Han Wang, Shaoqing Xu, Yuechen Luo, Fang Li, Chenxu Dang, Junli Wang, Tao Xu, Jing Wu, Jianhua Wu, Xiaoshuai Hao, Wen Zhang, Tianyi Jiang, Lingfeng Zhang, Lei Zhou, Yingbo Tang

2026-04-21

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Summary

This paper focuses on improving how self-driving cars predict what other vehicles and pedestrians will do in the future, specifically by making the reasoning process faster without sacrificing accuracy.

What's the problem?

Currently, a technique called 'Chain-of-Thought' reasoning is really good at predicting trajectories, but it takes time because it works step-by-step. Faster methods, called 'latent CoT', try to compress this reasoning, but they haven't been as accurate. The researchers believe this is because these faster methods focus on just the words used to explain the reasoning, instead of understanding the actual physics and rules of how things move in the real world.

What's the solution?

The researchers created a new system called OneVL. It combines understanding both language and the visual world. It uses a compact 'latent' space to represent the reasoning, but importantly, it's trained to not only recreate the explanation in words, but also to predict what the next frame of video will look like. This forces the system to learn the underlying rules of how roads, cars, and people behave. During actual use, the extra parts used for training are removed, allowing for very fast predictions.

Why it matters?

OneVL is the first 'latent CoT' method that performs *better* than the slower, step-by-step methods. This is a big deal because it means self-driving cars can potentially make accurate predictions about other agents' movements in real-time, which is crucial for safety and effective autonomous driving. It also shows that focusing on understanding the world's dynamics, not just the explanation, leads to better and more reliable AI.

Abstract

Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. At inference, the auxiliary decoders are discarded and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering state-of-the-art accuracy at answer-only latency, and providing direct evidence that tighter compression, when guided in both language and world-model supervision, produces more generalizable representations than verbose token-by-token reasoning. Project Page: https://xiaomi-embodied-intelligence.github.io/OneVL

View Paper