Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning

Yalcin Tur, Jalal Naghiyev, Haoquan Fang, Wei-Chuan Tsai, Jiafei Duan, Dieter Fox, Ranjay Krishna

2026-02-10

Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning

Summary

This paper introduces a new way for robots to understand instructions and perform tasks, focusing on how they 'think' through problems instead of just reacting. It's about making robots more efficient at complex actions.

What's the problem?

Current robots using vision and language often struggle with tasks that require multiple steps. Existing methods treat every step the same, wasting processing power on simple adjustments and using a lot of memory when trying to plan complex actions. A technique called 'Chain-of-Thought' helps with planning, but it requires a lot of memory and doesn't work well when actions need to be continuous, like smoothly moving a robotic arm.

What's the solution?

The researchers developed a system called RD-VLA, which stands for Recurrent-Depth Vision-Language-Action. Instead of generating step-by-step instructions like 'Chain-of-Thought', RD-VLA refines its actions internally, almost like a robot 'thinking' about how to improve its movements. It uses a repeating process with a constant amount of memory, and it learns to stop refining when the action is good enough. They trained the system using a technique that makes the learning process more manageable.

Why it matters?

This work is important because it allows robots to use processing power more efficiently. By only spending time on the parts of a task that need it, robots can work much faster – up to 80 times faster than previous methods – and use less memory. This makes it more practical to build robots that can handle complex tasks in the real world, opening the door for more advanced robotics applications.

Abstract

Current Vision-Language-Action (VLA) models rely on fixed computational depth, expending the same amount of compute on simple adjustments and complex multi-step manipulation. While Chain-of-Thought (CoT) prompting enables variable computation, it scales memory linearly and is ill-suited for continuous action spaces. We introduce Recurrent-Depth VLA (RD-VLA), an architecture that achieves computational adaptivity via latent iterative refinement rather than explicit token generation. RD-VLA employs a recurrent, weight-tied action head that supports arbitrary inference depth with a constant memory footprint. The model is trained using truncated backpropagation through time (TBPTT) to efficiently supervise the refinement process. At inference, RD-VLA dynamically allocates compute using an adaptive stopping criterion based on latent convergence. Experiments on challenging manipulation tasks show that recurrent depth is critical: tasks that fail entirely (0 percent success) with single-iteration inference exceed 90 percent success with four iterations, while simpler tasks saturate rapidly. RD-VLA provides a scalable path to test-time compute in robotics, replacing token-based reasoning with latent reasoning to achieve constant memory usage and up to 80x inference speedup over prior reasoning-based VLA models. Project page: https://rd-vla.github.io/

View Paper