(1D) Ordered Tokens Enable Efficient Test-Time Search

Zhitong Gao, Parham Rezaei, Ali Cy, Mingqiao Ye, Nataša Jovanović, Jesse Allardice, Afshin Dehghan, Amir Zamir, Roman Bachmann, Oğuzhan Fatih Kar

2026-04-20

(1D) Ordered Tokens Enable Efficient Test-Time Search

Summary

This paper investigates how the way data is broken down into smaller pieces, called tokens, affects how well we can control the output of AI models that generate things step-by-step, like images from text descriptions.

What's the problem?

When AI models generate images or text, they do it one piece at a time. It's hard to 'steer' this process to get exactly the result you want. The question is whether *how* the data is initially divided into tokens impacts our ability to guide the generation process during creation, by evaluating intermediate steps and making adjustments.

What's the solution?

The researchers focused on image generation and compared two ways of creating tokens: a traditional grid-like approach and a newer method that starts with a rough overview and then adds detail. They found that the newer, ordered tokenization method, which builds from coarse to fine details, allowed for better control and more efficient searching for the desired image. They even showed that, with a good way to judge the images, you could generate images from text *without* fully training a model, just by searching through possible token sequences.

Why it matters?

This work shows that the structure of tokens isn't just about how well a model learns, but also about how easily we can influence its output *after* it's been trained. This is important because it suggests ways to improve the flexibility and control of AI generation, and it could lead to more efficient methods for creating customized content.

Abstract

Tokenization is a key component of autoregressive (AR) generative models, converting raw data into more manageable units for modeling. Commonly, tokens describe local information, such as regions of pixels in images or word pieces in text, and AR generation predicts these tokens in a fixed order. A worthwhile question is whether token structures affect the ability to steer the generation through test-time search, where multiple candidate generations are explored and evaluated by a verifier. Using image generation as our testbed, we hypothesize that recent 1D ordered tokenizers with coarse-to-fine structure can be more amenable to search than classical 2D grid structures. This is rooted in the fact that the intermediate states in coarse-to-fine sequences carry semantic meaning that verifiers can reliably evaluate, enabling effective steering during generation. Through controlled experiments, we find that AR models trained on coarse-to-fine ordered tokens exhibit improved test-time scaling behavior compared to grid-based counterparts. Moreover, we demonstrate that, thanks to the ordered structure, pure test-time search over token sequences (i.e., without training an AR model) can perform training-free text-to-image generation when guided by an image-text verifier. Beyond this, we systematically study how classical search algorithms (best-of-N, beam search, lookahead search) interact with different token structures, as well as the role of different verifiers and AR priors. Our results highlight the impact of token structure on inference-time scalability and provide practical guidance for test-time scaling in AR models.

View Paper