One Diffusion to Generate Them All

Duong H. Le, Tuan Pham, Sangho Lee, Christopher Clark, Aniruddha Kembhavi, Stephan Mandt, Ranjay Krishna, Jiasen Lu

2024-11-26

Summary

This paper presents OneDiffusion, a powerful new model that can generate and understand images based on various inputs, making it useful for many different tasks.

What's the problem?

Creating high-quality images and understanding them in different contexts can be challenging, especially when using traditional models that are limited to specific tasks. These models often require separate architectures for each task, which can be inefficient and complex.

What's the solution?

OneDiffusion addresses this problem by using a single, versatile model that can handle multiple tasks like generating images from text, improving image quality, and estimating depth from images. It does this by treating all tasks as sequences of frames with different levels of noise during training. This means that any frame can be used as a reference when generating new images. The model is trained on a diverse dataset called One-Gen, which combines various types of image data to improve its performance across different tasks.

Why it matters?

This research is significant because it simplifies the process of image generation and understanding by using one model for many tasks instead of needing different models for each one. This can lead to more efficient AI systems that are easier to use and adapt to new challenges in fields like computer vision, gaming, and virtual reality.

Abstract

We introduce OneDiffusion, a versatile, large-scale diffusion model that seamlessly supports bidirectional image synthesis and understanding across diverse tasks. It enables conditional generation from inputs such as text, depth, pose, layout, and semantic maps, while also handling tasks like image deblurring, upscaling, and reverse processes such as depth estimation and segmentation. Additionally, OneDiffusion allows for multi-view generation, camera pose estimation, and instant personalization using sequential image inputs. Our model takes a straightforward yet effective approach by treating all tasks as frame sequences with varying noise scales during training, allowing any frame to act as a conditioning image at inference time. Our unified training framework removes the need for specialized architectures, supports scalable multi-task training, and adapts smoothly to any resolution, enhancing both generalization and scalability. Experimental results demonstrate competitive performance across tasks in both generation and prediction such as text-to-image, multiview generation, ID preservation, depth estimation and camera pose estimation despite relatively small training dataset. Our code and checkpoint are freely available at https://github.com/lehduong/OneDiffusion

View Paper