Weak-to-Strong Diffusion with Reflection
Lichen Bai, Masashi Sugiyama, Zeke Xie
2025-02-07
Summary
This paper talks about Weak-to-Strong Diffusion (W2SD), a new method for improving AI models that generate images, videos, and other content. W2SD uses the differences between weaker and stronger AI models to make the stronger models even better.
What's the problem?
Current AI models that generate content, like images or videos, often have a gap between what they produce and real-world data. This happens because of limitations in training data, model design, and other factors. As a result, the generated content might not look as realistic or match user instructions as well as desired.
What's the solution?
The researchers developed W2SD, which compares how weaker and stronger AI models perform and uses this information to improve the stronger model. W2SD uses a process called 'reflection' that alternates between creating and reversing content generation, guided by the differences between models. This helps the AI create content that's closer to real-world data. W2SD is flexible and can be used with different types of AI models and for various kinds of content.
Why it matters?
This research matters because it offers a way to make AI-generated content look more realistic and better match what users want. It can improve things like image quality, how well the AI follows instructions, and overall user satisfaction. W2SD could lead to better AI tools for creating art, videos, and other media, potentially changing how we produce and interact with digital content in fields like entertainment, advertising, and education.
Abstract
The goal of diffusion generative models is to align the learned distribution with the real data distribution through gradient score matching. However, inherent limitations in training data quality, modeling strategies, and architectural design lead to inevitable gap between generated outputs and real data. To reduce this gap, we propose Weak-to-Strong Diffusion (W2SD), a novel framework that utilizes the estimated difference between existing weak and strong models (i.e., weak-to-strong difference) to approximate the gap between an ideal model and a strong model. By employing a reflective operation that alternates between denoising and inversion with weak-to-strong difference, we theoretically understand that W2SD steers latent variables along sampling trajectories toward regions of the real data distribution. W2SD is highly flexible and broadly applicable, enabling diverse improvements through the strategic selection of weak-to-strong model pairs (e.g., DreamShaper vs. SD1.5, good experts vs. bad experts in MoE). Extensive experiments demonstrate that W2SD significantly improves human preference, aesthetic quality, and prompt adherence, achieving SOTA performance across various modalities (e.g., image, video), architectures (e.g., UNet-based, DiT-based, MoE), and benchmarks. For example, Juggernaut-XL with W2SD can improve with the HPSv2 winning rate up to 90% over the original results. Moreover, the performance gains achieved by W2SD markedly outweigh its additional computational overhead, while the cumulative improvements from different weak-to-strong difference further solidify its practical utility and deployability.