Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think
Gonzalo Martin Garcia, Karim Abou Zeid, Christian Schmidt, Daan de Geus, Alexander Hermans, Bastian Leibe
2024-09-18

Summary
This paper discusses a new method for fine-tuning image-conditional diffusion models, showing that it's easier and faster than previously thought, especially for estimating depth from images.
What's the problem?
Existing diffusion models for depth estimation are often slow and require many steps to produce accurate results. This inefficiency limits their practical use, making them less appealing for real-time applications where speed is important.
What's the solution?
The researchers discovered that the inefficiency was due to an unnoticed flaw in the way these models processed images. By fixing this flaw, they created a single-step model that performs over 200 times faster than traditional methods while maintaining high accuracy. They also introduced a fine-tuning approach that optimizes the model for specific tasks, allowing it to outperform other depth estimation models without needing extensive additional training.
Why it matters?
This research is significant because it demonstrates that fine-tuning image-conditional diffusion models can be much more efficient than previously believed. By improving speed and performance, this work opens up new possibilities for using these models in real-world applications like robotics, autonomous vehicles, and augmented reality, where quick and accurate depth estimation is crucial.
Abstract
Recent work showed that large diffusion models can be reused as highly precise monocular depth estimators by casting depth estimation as an image-conditional image generation task. While the proposed model achieved state-of-the-art results, high computational demands due to multi-step inference limited its use in many scenarios. In this paper, we show that the perceived inefficiency was caused by a flaw in the inference pipeline that has so far gone unnoticed. The fixed model performs comparably to the best previously reported configuration while being more than 200times faster. To optimize for downstream task performance, we perform end-to-end fine-tuning on top of the single-step model with task-specific losses and get a deterministic model that outperforms all other diffusion-based depth and normal estimation models on common zero-shot benchmarks. We surprisingly find that this fine-tuning protocol also works directly on Stable Diffusion and achieves comparable performance to current state-of-the-art diffusion-based depth and normal estimation models, calling into question some of the conclusions drawn from prior works.