DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training
Haoran Feng, Dizhe Zhang, Xiangtai Li, Bo Du, Lu Qi
2025-10-14
Summary
This paper introduces DiT360, a new system for creating realistic 360-degree images from different types of input, like regular photos or incomplete images.
What's the problem?
Generating high-quality 360-degree images is difficult because there isn't a lot of readily available, real-world data to train these systems. Existing methods often focus on improving the model itself, but this research argues the biggest issue is the lack of good training data, leading to problems with realistic details and keeping the image consistent all the way around.
What's the solution?
DiT360 tackles this by using a combination of techniques. It learns from both standard perspective images and 360-degree images, essentially translating knowledge between the two. It improves the images at two stages: first, by using perspective images to guide the creation of the 360 view and refine the overall look, and second, by adding special 'rules' during the image processing to ensure smooth transitions around the edges, handle rotations correctly, and avoid distortion. These 'rules' act like extra checks to make sure the final image is high quality.
Why it matters?
This work is important because it shows that improving the *data* used to train these systems can be just as, if not more, important than just making the model more complex. By achieving better consistency and realism in generated 360-degree images, it opens up possibilities for more immersive virtual reality experiences, better image editing tools, and more realistic content creation.
Abstract
In this work, we propose DiT360, a DiT-based framework that performs hybrid training on perspective and panoramic data for panoramic image generation. For the issues of maintaining geometric fidelity and photorealism in generation quality, we attribute the main reason to the lack of large-scale, high-quality, real-world panoramic data, where such a data-centric view differs from prior methods that focus on model design. Basically, DiT360 has several key modules for inter-domain transformation and intra-domain augmentation, applied at both the pre-VAE image level and the post-VAE token level. At the image level, we incorporate cross-domain knowledge through perspective image guidance and panoramic refinement, which enhance perceptual quality while regularizing diversity and photorealism. At the token level, hybrid supervision is applied across multiple modules, which include circular padding for boundary continuity, yaw loss for rotational robustness, and cube loss for distortion awareness. Extensive experiments on text-to-panorama, inpainting, and outpainting tasks demonstrate that our method achieves better boundary consistency and image fidelity across eleven quantitative metrics. Our code is available at https://github.com/Insta360-Research-Team/DiT360.