TIC-VLA: A Think-in-Control Vision-Language-Action Model for Robot Navigation in Dynamic Environments

Zhiyu Huang, Yun Zhang, Johnson Liu, Rui Song, Chen Tang, Jiaqi Ma

2026-02-12

TIC-VLA: A Think-in-Control Vision-Language-Action Model for Robot Navigation in Dynamic Environments

Summary

This paper introduces a new way for robots to understand and follow spoken instructions while moving around in the real world, even when there's a delay between hearing the instruction and being able to act on it.

What's the problem?

Currently, many robot control systems that use both vision and language assume the robot can instantly understand what's being said and react accordingly. However, understanding language takes time – the robot needs to 'think' about the instruction before it can execute it. This creates a mismatch between the time it takes to process information and the need for immediate, real-time control, especially in busy environments where quick reactions are important.

What's the solution?

The researchers developed a system called Think-in-Control (TIC)-VLA. This system acknowledges the delay in understanding language and builds it into the robot's decision-making process. It allows the robot to use both current observations *and* its understanding of past instructions, along with information about how long the delay is, to make better choices. They also created a realistic simulation environment called DynaNav to test their system, and trained the robot using a method that mimics real-world delays.

Why it matters?

This work is important because it makes robots more reliable and adaptable in everyday situations. By accounting for the time it takes to process language, robots can navigate complex environments and follow instructions more effectively, even when things don't happen perfectly on time. This is a step towards robots that can truly assist people in dynamic, real-world settings.

Abstract

Robots in dynamic, human-centric environments must follow language instructions while maintaining real-time reactive control. Vision-language-action (VLA) models offer a promising framework, but they assume temporally aligned reasoning and control, despite semantic inference being inherently delayed relative to real-time action. We introduce Think-in-Control (TIC)-VLA, a latency-aware framework that explicitly models delayed semantic reasoning during action generation. TIC-VLA defines a delayed semantic-control interface that conditions action generation on delayed vision-language semantic states and explicit latency metadata, in addition to current observations, enabling policies to compensate for asynchronous reasoning. We further propose a latency-consistent training pipeline that injects reasoning inference delays during imitation learning and online reinforcement learning, aligning training with asynchronous deployment. To support realistic evaluation, we present DynaNav, a physics-accurate, photo-realistic simulation suite for language-guided navigation in dynamic environments. Extensive experiments in simulation and on a real robot show that TIC-VLA consistently outperforms prior VLA models while maintaining robust real-time control under multi-second reasoning latency. Project website: https://ucla-mobility.github.io/TIC-VLA/

View Paper