Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

Jiyuan Wang, Chunyu Lin, Lei Sun, Zhi Cao, Yuyang Yin, Lang Nie, Zhenlong Yuan, Xiangxiang Chu, Yunchao Wei, Kang Liao, Guosheng Lin

2026-03-11

Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

Summary

This paper introduces a new method for editing 3D objects using 2D image editing techniques, focusing on making sure the edits look consistent from all angles.

What's the problem?

When you try to edit a 3D object by changing its 2D images, it's really hard to make sure all the views of the object still make sense together – things can look distorted or broken when you rotate around it. Also, there isn't much training data available where 3D objects have been edited in a way that's known to be correct, making it difficult to teach a computer how to do this well using traditional methods.

What's the solution?

The researchers used a technique called reinforcement learning, which is like teaching a computer through trial and error with rewards. They built a system, called RL3DEdit, that edits 2D images and then gets feedback on how well the resulting 3D object looks consistent. This feedback comes from a powerful 3D model, VGGT, which checks for things like confidence in the image and how accurately the object's pose is estimated. By rewarding edits that lead to consistent 3D views, the system learns to make better edits.

Why it matters?

This work is important because it provides a way to edit 3D objects more effectively, even without a lot of example data. It opens the door to more realistic and user-friendly 3D editing tools, and the researchers are even sharing their code and model to help others build on this research.

Abstract

Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose RL3DEdit, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT. Specifically, we leverage VGGT's robust priors learned from massive real-world data, feed the edited images, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency. To promote the development of 3D editing, we will release the code and model.

View Paper