AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model

Yutian Chen, Shi Guo, Renbiao Jin, Tianshuo Yang, Xin Cai, Yawen Luo, Mingxin Yang, Mulin Yu, Linning Xu, Tianfan Xue

2026-04-22

AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model

Summary

This paper introduces a new method, AnyRecon, for creating 3D models from a small number of photos or views of an object or scene. It focuses on improving the quality and scalability of 3D reconstruction when you don't have a lot of input data.

What's the problem?

Reconstructing 3D models from just a few pictures is really hard, especially if those pictures are taken from random angles. Existing methods using artificial intelligence often only look at one or two pictures at a time, which can lead to inconsistencies in the final model and doesn't work well for large or complex scenes. They struggle to 'remember' the overall shape when dealing with many different viewpoints.

What's the solution?

AnyRecon tackles this by creating a kind of 'memory' of the scene using all the available images. It doesn't try to compress information from the images, so it can keep track of details even when the viewpoints are very different. It also cleverly combines the process of generating new views with the process of actually building the 3D model, using a 3D 'map' to guide both. Finally, it uses some tricks to make the process faster and more efficient, even with a lot of images.

Why it matters?

This work is important because it allows us to create detailed 3D models from everyday photos taken with a phone or camera, without needing special equipment or a lot of effort. This has applications in areas like virtual reality, augmented reality, and creating 3D content for games or movies, and it makes 3D reconstruction more accessible for larger and more complex environments.

Abstract

Sparse-view 3D reconstruction is essential for modeling scenes from casual captures, but remain challenging for non-generative reconstruction. Existing diffusion-based approaches mitigates this issues by synthesizing novel views, but they often condition on only one or two capture frames, which restricts geometric consistency and limits scalability to large or diverse scenes. We propose AnyRecon, a scalable framework for reconstruction from arbitrary and unordered sparse inputs that preserves explicit geometric control while supporting flexible conditioning cardinality. To support long-range conditioning, our method constructs a persistent global scene memory via a prepended capture view cache, and removes temporal compression to maintain frame-level correspondence under large viewpoint changes. Beyond better generative model, we also find that the interplay between generation and reconstruction is crucial for large-scale 3D scenes. Thus, we introduce a geometry-aware conditioning strategy that couples generation and reconstruction through an explicit 3D geometric memory and geometry-driven capture-view retrieval. To ensure efficiency, we combine 4-step diffusion distillation with context-window sparse attention to reduce quadratic complexity. Extensive experiments demonstrate robust and scalable reconstruction across irregular inputs, large viewpoint gaps, and long trajectories.

View Paper