When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

Zhengyang Sun, Yu Chen, Xin Zhou, Xiaofan Li, Xiwu Chen, Dingkang Liang, Xiang Bai

2026-04-10

When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

Summary

This paper introduces a new method called NUMINA that improves how well text-to-video AI models follow instructions about the *number* of objects to include in a video.

What's the problem?

Current text-to-video AI models are really good at creating videos from text descriptions, but they often mess up when the description specifies a precise number of things. For example, if you ask for 'three red balls,' you might get two, four, or even just one. This is because the AI struggles to accurately translate the numerical part of the prompt into the visual scene.

What's the solution?

NUMINA doesn't require any additional training of the AI model itself. Instead, it works by first figuring out what the AI *thinks* the layout of the scene is, focusing on parts of the AI's internal processing that deal with counting. Then, it gently adjusts this internal 'layout' to match the requested number of objects and uses that adjusted layout to guide the video generation process, making sure the correct number of things appear. It essentially gives the AI a little nudge in the right direction.

Why it matters?

This work is important because getting the number of objects right is crucial for creating realistic and useful videos. Improving 'numerical alignment' makes these AI models more reliable and opens the door to more complex and specific video creations. It shows that you can improve these models without needing to retrain them, which is a much more practical approach.

Abstract

Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion. The code is available at https://github.com/H-EmbodVis/NUMINA.

View Paper