D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation

Nobline Yoo, Olga Russakovsky, Ye Zhu

2025-11-05

D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation

Summary

This paper focuses on improving how well AI image generators understand and accurately depict the *number* of objects requested in a text prompt, like asking for 'three apples' and actually getting three apples in the image.

What's the problem?

Current AI image generators are really good at making images that *match* what you ask for, but they often mess up when it comes to getting the *quantity* right. Existing methods try to fix this by adding extra 'counting' tools, but these tools need to work with the image generation process in a specific way – they have to be able to provide feedback that the AI can use to adjust the image. This limits the types of counting tools that can be used, preventing the use of more accurate, but less flexible, counting methods.

What's the solution?

The researchers developed a new technique called Detector-to-Differentiable (D2D). Essentially, they figured out how to make even the most accurate, but normally incompatible, object counting programs work *with* the image generator. They do this by converting the counting program’s output into a format the image generator can understand and use to improve the number of objects in the image. It’s like translating the counting program’s language into a language the image generator can speak.

Why it matters?

This work is important because it allows AI image generators to create images with the correct number of objects more reliably. This means you’re more likely to get exactly what you ask for, and it opens the door to using the best possible counting tools to improve image generation quality without sacrificing image detail or slowing down the process.

Abstract

Text-to-image (T2I) diffusion models have achieved strong performance in semantic alignment, yet they still struggle with generating the correct number of objects specified in prompts. Existing approaches typically incorporate auxiliary counting networks as external critics to enhance numeracy. However, since these critics must provide gradient guidance during generation, they are restricted to regression-based models that are inherently differentiable, thus excluding detector-based models with superior counting ability, whose count-via-enumeration nature is non-differentiable. To overcome this limitation, we propose Detector-to-Differentiable (D2D), a novel framework that transforms non-differentiable detection models into differentiable critics, thereby leveraging their superior counting ability to guide numeracy generation. Specifically, we design custom activation functions to convert detector logits into soft binary indicators, which are then used to optimize the noise prior at inference time with pre-trained T2I models. Our extensive experiments on SDXL-Turbo, SD-Turbo, and Pixart-DMD across four benchmarks of varying complexity (low-density, high-density, and multi-object scenarios) demonstrate consistent and substantial improvements in object counting accuracy (e.g., boosting up to 13.7% on D2D-Small, a 400-prompt, low-density benchmark), with minimal degradation in overall image quality and computational overhead.

View Paper