Sharp Monocular View Synthesis in Less Than a Second

Lars Mescheder, Wei Dong, Shiwei Li, Xuyang Bai, Marcel Santos, Peiyun Hu, Bruno Lecouat, Mingmin Zhen, Amaël Delaunoy, Tian Fang, Yanghai Tsin, Stephan R. Richter, Vladlen Koltun

2025-12-15

Sharp Monocular View Synthesis in Less Than a Second

Summary

This paper introduces SHARP, a new method for creating realistic views of a scene starting from just a single photograph.

What's the problem?

Traditionally, generating different viewpoints of a 3D scene required multiple images or complex 3D modeling. Taking a single photo and convincingly creating new views from it has been a challenge, often resulting in blurry or unrealistic images and taking a long time to process.

What's the solution?

SHARP uses a neural network to quickly estimate the 3D structure of the scene from the single input image. It represents this structure using '3D Gaussians,' which are mathematical functions that describe the shape and appearance of objects in the scene. The network figures out the best parameters for these Gaussians in under a second using a standard computer graphics card. Once it has this 3D representation, it can render new views of the scene almost instantly.

Why it matters?

SHARP is a significant step forward because it's much faster and produces higher quality results than previous methods. It can create realistic images from new viewpoints with a level of detail that was previously unattainable from a single image, and it works well on different types of images without needing to be specifically trained for each one. This has potential applications in areas like virtual reality, augmented reality, and creating 3D content more easily.

Abstract

We present SHARP, an approach to photorealistic view synthesis from a single image. Given a single photograph, SHARP regresses the parameters of a 3D Gaussian representation of the depicted scene. This is done in less than a second on a standard GPU via a single feedforward pass through a neural network. The 3D Gaussian representation produced by SHARP can then be rendered in real time, yielding high-resolution photorealistic images for nearby views. The representation is metric, with absolute scale, supporting metric camera movements. Experimental results demonstrate that SHARP delivers robust zero-shot generalization across datasets. It sets a new state of the art on multiple datasets, reducing LPIPS by 25-34% and DISTS by 21-43% versus the best prior model, while lowering the synthesis time by three orders of magnitude. Code and weights are provided at https://github.com/apple/ml-sharp

View Paper