V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising
Han Lin, Xichen Pan, Zun Wang, Yue Zhang, Chu Wang, Jaemin Cho, Mohit Bansal
2026-03-18
Summary
This paper investigates how to improve image generation using a type of AI called diffusion models, specifically when working directly with the pixels of an image rather than a compressed version. It focuses on adding extra 'visual guidance' to these models to help them create more realistic and detailed images.
What's the problem?
While diffusion models that work directly with pixels are good at generating images without needing a lot of initial training, they sometimes struggle with understanding the overall meaning or structure of what they're creating. Previous attempts to improve this by adding visual cues have been messy, with lots of different changes happening at once, making it hard to know which changes actually helped and why.
What's the solution?
The researchers created a system called V-Co to carefully test different ways of adding visual guidance to pixel-space diffusion models. They built a controlled environment where they could isolate and evaluate each improvement individually. They found that four things are key: using separate processing paths for the image and the visual guidance, a specific way of controlling the generation process, a better way to tell the model what to focus on, and a method to balance the information from the image and the guidance. Combining these elements resulted in a simple and effective method for improving image generation.
Why it matters?
This work provides a clear recipe for building better image generation models. By identifying the most important components of visual guidance, it helps researchers and developers create more powerful and efficient AI systems that can generate high-quality images with a better understanding of visual content, and it does so using fewer resources and training time.
Abstract
Pixel-space diffusion has recently re-emerged as a strong alternative to latent diffusion, enabling high-quality generation without pretrained autoencoders. However, standard pixel-space diffusion models receive relatively weak semantic supervision and are not explicitly designed to capture high-level visual structure. Recent representation-alignment methods (e.g., REPA) suggest that pretrained visual features can substantially improve diffusion training, and visual co-denoising has emerged as a promising direction for incorporating such features into the generative process. However, existing co-denoising approaches often entangle multiple design choices, making it unclear which design choices are truly essential. Therefore, we present V-Co, a systematic study of visual co-denoising in a unified JiT-based framework. This controlled setting allows us to isolate the ingredients that make visual co-denoising effective. Our study reveals four key ingredients for effective visual co-denoising. First, preserving feature-specific computation while enabling flexible cross-stream interaction motivates a fully dual-stream architecture. Second, effective classifier-free guidance (CFG) requires a structurally defined unconditional prediction. Third, stronger semantic supervision is best provided by a perceptual-drifting hybrid loss. Fourth, stable co-denoising further requires proper cross-stream calibration, which we realize through RMS-based feature rescaling. Together, these findings yield a simple recipe for visual co-denoising. Experiments on ImageNet-256 show that, at comparable model sizes, V-Co outperforms the underlying pixel-space diffusion baseline and strong prior pixel-diffusion methods while using fewer training epochs, offering practical guidance for future representation-aligned generative models.