Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization

Canyu Zhao, Xiaoman Li, Tianjian Feng, Zhiyue Zhao, Hao Chen, Chunhua Shen

2025-08-21

Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization

Summary

This paper introduces Tinker, a new way to edit 3D objects that works really well even with just one or two pictures and doesn't need a lot of computer work for each new scene. It uses powerful image-generating AI, called diffusion models, in a smart way to understand 3D, and they've also created a big collection of data to help train these systems.

What's the problem?

Traditionally, editing 3D objects from pictures is hard because you need a lot of pictures and a lot of computer processing to make sure the edits look right from every angle. This process can be very time-consuming and requires specific adjustments for each individual 3D scene.

What's the solution?

Tinker solves this by using pre-trained AI models that already know a lot about images to understand 3D. It has two main parts: one that lets you make precise edits based on a reference image that then look consistent from all viewpoints, and another that can create a full 3D scene or new views from just a few input images, using AI that's good with videos to fill in the gaps.

Why it matters?

Tinker makes creating and editing 3D content much easier and faster for everyone, especially for people who don't have super powerful computers or lots of time for complex setups. It's a big step towards making 3D editing as simple as editing a regular photo, opening up new possibilities for games, virtual reality, and other creative applications.

Abstract

We introduce Tinker, a versatile framework for high-fidelity 3D editing that operates in both one-shot and few-shot regimes without any per-scene finetuning. Unlike prior techniques that demand extensive per-scene optimization to ensure multi-view consistency or to produce dozens of consistent edited input views, Tinker delivers robust, multi-view consistent edits from as few as one or two images. This capability stems from repurposing pretrained diffusion models, which unlocks their latent 3D awareness. To drive research in this space, we curate the first large-scale multi-view editing dataset and data pipeline, spanning diverse scenes and styles. Building on this dataset, we develop our framework capable of generating multi-view consistent edited views without per-scene training, which consists of two novel components: (1) Referring multi-view editor: Enables precise, reference-driven edits that remain coherent across all viewpoints. (2) Any-view-to-video synthesizer: Leverages spatial-temporal priors from video diffusion to perform high-quality scene completion and novel-view generation even from sparse inputs. Through extensive experiments, Tinker significantly reduces the barrier to generalizable 3D content creation, achieving state-of-the-art performance on editing, novel-view synthesis, and rendering enhancement tasks. We believe that Tinker represents a key step towards truly scalable, zero-shot 3D editing. Project webpage: https://aim-uofa.github.io/Tinker

View Paper