MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation
Zehuan Huang, Yuan-Chen Guo, Xingqiao An, Yunhan Yang, Yangguang Li, Zi-Xin Zou, Ding Liang, Xihui Liu, Yan-Pei Cao, Lu Sheng
2024-12-05

Summary
This paper introduces MIDI, a new method for creating detailed 3D scenes from a single image using a technique called multi-instance diffusion.
What's the problem?
Generating 3D scenes from images is challenging because most existing methods either require multiple images or rely on complex processes that can be slow and inefficient. These methods often struggle to create accurate spatial relationships between objects in the scene, which makes it hard to produce realistic 3D environments.
What's the solution?
MIDI addresses these challenges by allowing the generation of multiple 3D objects simultaneously from just one image. It uses a special attention mechanism to understand how different objects in the scene interact with each other and maintains their spatial relationships. This means that MIDI can create a cohesive 3D scene without needing to go through complicated steps. The model also learns from both detailed object data and broader scene data to improve its understanding and generation capabilities.
Why it matters?
This research is important because it simplifies the process of creating realistic 3D environments, making it more accessible for applications like video games, virtual reality, and architectural design. By improving how AI can generate 3D scenes from single images, MIDI opens up new possibilities for creative projects and enhances the realism of digital content.
Abstract
This paper introduces MIDI, a novel paradigm for compositional 3D scene generation from a single image. Unlike existing methods that rely on reconstruction or retrieval techniques or recent approaches that employ multi-stage object-by-object generation, MIDI extends pre-trained image-to-3D object generation models to multi-instance diffusion models, enabling the simultaneous generation of multiple 3D instances with accurate spatial relationships and high generalizability. At its core, MIDI incorporates a novel multi-instance attention mechanism, that effectively captures inter-object interactions and spatial coherence directly within the generation process, without the need for complex multi-step processes. The method utilizes partial object images and global scene context as inputs, directly modeling object completion during 3D generation. During training, we effectively supervise the interactions between 3D instances using a limited amount of scene-level data, while incorporating single-object data for regularization, thereby maintaining the pre-trained generalization ability. MIDI demonstrates state-of-the-art performance in image-to-scene generation, validated through evaluations on synthetic data, real-world scene data, and stylized scene images generated by text-to-image diffusion models.