SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass

Yanxu Meng, Haoning Wu, Ya Zhang, Weidi Xie

2025-08-22

SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass

Summary

This research introduces a new system called SceneGen that automatically creates multiple 3D objects within a single image, like building a virtual scene from a picture.

What's the problem?

Creating 3D models is hard, and usually requires either manually designing them or finding existing ones that fit your needs. Existing methods for automatically generating 3D content often need a lot of tweaking or rely on having a database of pre-made assets to pull from. The challenge here is to generate multiple, realistic 3D objects directly from a single image without needing to search for existing models or go through a lengthy optimization process.

What's the solution?

The researchers developed SceneGen, a system that takes a regular image and identifies which parts are objects (using 'masks'). It then builds 3D models of those objects, complete with color and texture, all at once. A key part of their approach is a new way to combine information from different parts of the image – both the visual details and the shapes – to understand where objects are positioned in the scene. Surprisingly, even though it was only trained on single images, SceneGen can also work better when given multiple images of the same scene. The code for SceneGen is available online so others can use and build upon it.

Why it matters?

This work is important because it offers a faster and more flexible way to create 3D content. This could be really useful for things like virtual reality, augmented reality, and even helping robots understand the world around them. By automatically generating 3D scenes from images, it removes a major bottleneck in creating immersive experiences and intelligent systems.

Abstract

3D content generation has recently attracted significant research interest due to its applications in VR/AR and embodied AI. In this work, we address the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen's direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architectural design enables improved generation performance with multi-image inputs; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robust generation abilities of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks. The code and model will be publicly available at: https://mengmouxu.github.io/SceneGen.

View Paper