UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

Leon Liangyu Chen, Haoyu Ma, Zhipeng Fan, Ziqi Huang, Animesh Sinha, Xiaoliang Dai, Jialiang Wang, Zecheng He, Jianwei Yang, Chunyuan Li, Junzhe Sun, Chu Wang, Serena Yeung-Levy, Felix Juefei-Xu

2026-02-18

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

Summary

This paper introduces a new method, UniT, for improving how well AI models that handle both images and text can solve complex problems. It focuses on letting these models think step-by-step and refine their answers over multiple attempts, similar to how humans tackle difficult tasks.

What's the problem?

Current AI models that work with both images and text usually give one answer without a chance to check their work or improve it. Many real-world problems, like understanding a scene with lots of objects or following complicated instructions, require breaking down the problem, verifying each step, and correcting mistakes. While making language models 'think' more by using extra computing power during use has been shown to help, it hasn't been successfully applied to these combined image and text models.

What's the solution?

The researchers developed UniT, a system that allows a single AI model to reason, check its reasoning, and improve its answers in multiple rounds. They did this by training the model to create its own practice problems, learning to edit its own work, and remembering information from previous steps. The key is that the model learns to think through problems sequentially, one step at a time, which is more efficient than trying many different answers at once.

Why it matters?

This work shows that letting AI models 'think' step-by-step during use, specifically for models that handle images and text, is a powerful way to improve their performance. It’s a more efficient use of computing power and helps the models handle situations they haven’t seen before, ultimately making them better at both understanding and generating content based on visual information.

Abstract

Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, or evolving instructions, require decomposing instructions, verifying intermediate results, and making iterative corrections. While test-time scaling (TTS) has demonstrated that allocating additional inference compute for iterative reasoning substantially improves language model performance, extending this paradigm to unified multimodal models remains an open challenge. We introduce UniT, a framework for multimodal chain-of-thought test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds. UniT combines agentic data synthesis, unified model training, and flexible test-time inference to elicit cognitive behaviors including verification, subgoal decomposition, and content memory. Our key findings are: (1) unified models trained on short reasoning trajectories generalize to longer inference chains at test time; (2) sequential chain-of-thought reasoning provides a more scalable and compute-efficient TTS strategy than parallel sampling; (3) training on generation and editing trajectories improves out-of-distribution visual reasoning. These results establish multimodal test-time scaling as an effective paradigm for advancing both generation and understanding in unified models.

View Paper