Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback

Nina Konovalova, Maxim Nikolaev, Andrey Kuznetsov, Aibek Alanov

2025-07-04

Heeding the Inner Voice: Aligning ControlNet Training via Intermediate
Features Feedback

Summary

This paper talks about InnerControl, a method that improves how text-to-image AI models create images by making sure the parts of the image stay consistent and well-controlled through every step of the image-making process.

What's the problem?

The problem is that many text-to-image models sometimes lose control over how the image looks during the creation process, which can cause parts of the image to not match the user’s instructions or appear blurry and inconsistent.

What's the solution?

The researchers added lightweight convolutional probes that check the image features at intermediate steps during the generation. These probes give feedback to the model to keep the image aligned with the input instructions, enforcing spatial consistency throughout the process.

Why it matters?

This matters because it helps generate better, clearer, and more accurate images from text inputs, making AI tools more reliable for artists, designers, and anyone using text-to-image technology.

Abstract

InnerControl enhances text-to-image diffusion models by enforcing spatial consistency across all diffusion steps using lightweight convolutional probes, improving control fidelity and generation quality.

View Paper