Towards Scalable and Consistent 3D Editing
Ruihao Xia, Yang Tang, Pan Zhou
2025-10-10
Summary
This paper focuses on making it easier to edit 3D models, which is a crucial skill for creating things like video game assets, virtual reality experiences, and augmented reality content.
What's the problem?
Currently, editing 3D models is really hard. Unlike editing a 2D image, you have to make sure the changes look right from *all* angles, and the model doesn't get warped or broken. Existing methods are either slow, create noticeable distortions, or require users to painstakingly outline exactly what parts of the 3D model they want to change, which is time-consuming and prone to errors.
What's the solution?
The researchers tackled this problem in two main ways. First, they created a huge dataset called 3DEditVerse, containing over 116,000 examples of 3D models with corresponding edits. This dataset is designed to be high-quality and consistent. Second, they developed a new model called 3DEditFormer. This model uses a special technique to understand the structure of the 3D model and only change the parts you want to edit, leaving the rest intact, without needing those manual outlines.
Why it matters?
This work is important because it makes 3D editing much more practical and accessible. By automating much of the process and improving the quality of edits, it could significantly speed up the creation of 3D content for a wide range of applications, from entertainment to design and beyond. The new dataset and model set a new benchmark for how well 3D editing can be done.
Abstract
3D editing - the task of locally modifying the geometry or appearance of a 3D asset - has wide applications in immersive content creation, digital entertainment, and AR/VR. However, unlike 2D editing, it remains challenging due to the need for cross-view consistency, structural fidelity, and fine-grained controllability. Existing approaches are often slow, prone to geometric distortions, or dependent on manual and accurate 3D masks that are error-prone and impractical. To address these challenges, we advance both the data and model fronts. On the data side, we introduce 3DEditVerse, the largest paired 3D editing benchmark to date, comprising 116,309 high-quality training pairs and 1,500 curated test pairs. Built through complementary pipelines of pose-driven geometric edits and foundation model-guided appearance edits, 3DEditVerse ensures edit locality, multi-view consistency, and semantic alignment. On the model side, we propose 3DEditFormer, a 3D-structure-preserving conditional transformer. By enhancing image-to-3D generation with dual-guidance attention and time-adaptive gating, 3DEditFormer disentangles editable regions from preserved structure, enabling precise and consistent edits without requiring auxiliary 3D masks. Extensive experiments demonstrate that our framework outperforms state-of-the-art baselines both quantitatively and qualitatively, establishing a new standard for practical and scalable 3D editing. Dataset and code will be released. Project: https://www.lv-lab.org/3DEditFormer/