Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation

Subin Kim, Sangwoo Mo, Mamshad Nayeem Rizve, Yiran Xu, Difan Liu, Jinwoo Shin, Tobias Hinz

2025-12-04

Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation

Summary

This paper tackles the problem of getting AI to create images or videos that *exactly* match what you ask for in a text description. It introduces a new method to improve these 'text-to-visual' generators.

What's the problem?

Currently, when you tell an AI to create an image, it often doesn't get it right on the first try. People have tried making the AI work harder – like running it for longer or trying multiple variations – but this only helps so much. The issue is that the initial text description, or 'prompt,' stays the same even when the AI's first attempts are off. It's like repeatedly asking for the same thing when the AI clearly isn't understanding.

What's the solution?

The researchers developed a system called PRIS, which stands for Prompt Redesign for Inference-time Scaling. PRIS doesn't just generate visuals repeatedly with the same prompt; it *changes* the prompt based on what went wrong in previous attempts. It looks at the generated images or videos, figures out what the AI consistently misunderstands, and then rewrites the prompt to be clearer. To help with this, they also created a way to check how well each part of the prompt matches what's actually in the image, providing more specific feedback than just saying 'good' or 'bad'.

Why it matters?

This work is important because it shows that improving AI-generated visuals isn't just about making the AI 'think' harder, but also about communicating with it more effectively. By adapting the prompt alongside the visual generation process, they achieved significant improvements, demonstrating that a smarter approach to prompting can unlock the full potential of these powerful AI tools.

Abstract

Achieving precise alignment between user intent and generated visuals remains a central challenge in text-to-visual generation, as a single attempt often fails to produce the desired output. To handle this, prior approaches mainly scale the visual generation process (e.g., increasing sampling steps or seeds), but this quickly leads to a quality plateau. This limitation arises because the prompt, crucial for guiding generation, is kept fixed. To address this, we propose Prompt Redesign for Inference-time Scaling, coined PRIS, a framework that adaptively revises the prompt during inference in response to the scaled visual generations. The core idea of PRIS is to review the generated visuals, identify recurring failure patterns across visuals, and redesign the prompt accordingly before regenerating the visuals with the revised prompt. To provide precise alignment feedback for prompt revision, we introduce a new verifier, element-level factual correction, which evaluates the alignment between prompt attributes and generated visuals at a fine-grained level, achieving more accurate and interpretable assessments than holistic measures. Extensive experiments on both text-to-image and text-to-video benchmarks demonstrate the effectiveness of our approach, including a 15% gain on VBench 2.0. These results highlight that jointly scaling prompts and visuals is key to fully leveraging scaling laws at inference-time. Visualizations are available at the website: https://subin-kim-cv.github.io/PRIS.

View Paper