FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow

Zhifei Yang, Guangyao Zhai, Keyang Lu, YuYang Yin, Chao Zhang, Zhen Xiao, Jieyi Long, Nassir Navab, Yikai Wang

2026-03-23

FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow

Summary

This paper introduces a new method, FlowScene, for creating realistic and controllable 3D scenes from instructions, combining both language and structural information.

What's the problem?

Currently, creating 3D scenes automatically has two main drawbacks. Methods that use language descriptions to build scenes often lack precise control over individual objects and struggle to maintain a consistent style throughout the entire scene. On the other hand, methods that focus on relationships between objects can control the scene better, but they often produce scenes that don't look very realistic or have detailed textures.

What's the solution?

FlowScene tackles this by using a system with three parts working together. It takes information about the scene as a 'multimodal graph' – basically a map of objects and their relationships. Then, it uses a special technique called 'rectified flow' to allow these three parts to constantly share information while building the scene. This allows for detailed control over the shape, texture, and arrangement of objects, while also ensuring the whole scene looks visually consistent and stylish.

Why it matters?

This research is important because it moves us closer to being able to automatically generate high-quality 3D scenes for applications like video games, movies, or even designing virtual environments. By combining control and realism, FlowScene offers a significant improvement over existing methods and creates scenes that are more believable and aligned with what people expect.

Abstract

Scene generation has extensive industrial applications, demanding both high realism and precise control over geometry and appearance. Language-driven retrieval methods compose plausible scenes from a large object database, but overlook object-level control and often fail to enforce scene-level style coherence. Graph-based formulations offer higher controllability over objects and inform holistic consistency by explicitly modeling relations, yet existing methods struggle to produce high-fidelity textured results, thereby limiting their practical utility. We present FlowScene, a tri-branch scene generative model conditioned on multimodal graphs that collaboratively generates scene layouts, object shapes, and object textures. At its core lies a tight-coupled rectified flow model that exchanges object information during generation, enabling collaborative reasoning across the graph. This enables fine-grained control of objects' shapes, textures, and relations while enforcing scene-level style coherence across structure and appearance. Extensive experiments show that FlowScene outperforms both language-conditioned and graph-conditioned baselines in terms of generation realism, style consistency, and alignment with human preferences.

View Paper