Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models
Shuhong Zheng, Zhipeng Bao, Ruoyu Zhao, Martial Hebert, Yu-Xiong Wang
2024-11-08

Summary
This paper presents Diff-2-in-1, a new framework that combines the generation of images and videos with advanced visual perception using diffusion models.
What's the problem?
Most existing systems use diffusion models either for generating images or for analyzing them, but they don't effectively combine these two tasks. This separation limits the models' ability to understand and create visual content in a cohesive way, which is important for complex applications that require both generation and perception.
What's the solution?
Diff-2-in-1 introduces a unified approach that allows diffusion models to handle both generating multi-modal data (like images and videos) and understanding visual content simultaneously. It uses a special process called denoising to enhance visual perception while creating realistic data. The framework also includes a self-improving learning mechanism, which helps the model get better at both tasks over time. The researchers conducted experiments that showed this new framework consistently outperforms previous methods in generating high-quality visual content and improving perception accuracy.
Why it matters?
This research is significant because it bridges the gap between creating and understanding visual content, making diffusion models more versatile. By improving how these models work together, Diff-2-in-1 can enhance applications in fields like computer vision, virtual reality, and interactive media, ultimately leading to richer user experiences.
Abstract
Beyond high-fidelity image synthesis, diffusion models have recently exhibited promising results in dense visual perception tasks. However, most existing work treats diffusion models as a standalone component for perception tasks, employing them either solely for off-the-shelf data augmentation or as mere feature extractors. In contrast to these isolated and thus sub-optimal efforts, we introduce a unified, versatile, diffusion-based framework, Diff-2-in-1, that can simultaneously handle both multi-modal data generation and dense visual perception, through a unique exploitation of the diffusion-denoising process. Within this framework, we further enhance discriminative visual perception via multi-modal generation, by utilizing the denoising network to create multi-modal data that mirror the distribution of the original training set. Importantly, Diff-2-in-1 optimizes the utilization of the created diverse and faithful data by leveraging a novel self-improving learning mechanism. Comprehensive experimental evaluations validate the effectiveness of our framework, showcasing consistent performance improvements across various discriminative backbones and high-quality multi-modal data generation characterized by both realism and usefulness.