Pippo: High-Resolution Multi-View Humans from a Single Image

Yash Kant, Ethan Weber, Jin Kyu Kim, Rawal Khirodkar, Su Zhaoen, Julieta Martinez, Igor Gilitschenski, Shunsuke Saito, Timur Bagautdinov

2025-02-12

Pippo: High-Resolution Multi-View Humans from a Single Image

Summary

This paper talks about Pippo, a new AI model that can take a single casual photo of a person and turn it into a high-quality video showing that person from many different angles, as if they were turning around in front of the camera.

What's the problem?

Creating realistic 3D videos of people from just one photo is really hard. Most current methods need extra information like body measurements or details about the camera that took the photo. This limits how useful these tools can be in real-world situations where we might only have a simple photo to work with.

What's the solution?

The researchers created Pippo, which uses a special kind of AI called a multi-view diffusion transformer. They trained Pippo on billions of human images and then fine-tuned it using high-quality studio photos. Pippo learns to create multiple views of a person at different resolutions and uses clever tricks to make sure all the views match up in 3D space. They also came up with a way for Pippo to create even more views than it was trained on, making the final videos smoother.

Why it matters?

This matters because it could change how we create and use 3D content of people. Imagine being able to turn any photo into a realistic 3D video - it could be used in video games, virtual reality, or even to help with online shopping for clothes. It's a big step towards making 3D content creation easier and more accessible, which could open up new possibilities in entertainment, education, and many other fields.

Abstract

We present Pippo, a generative model capable of producing 1K resolution dense turnaround videos of a person from a single casually clicked photo. Pippo is a multi-view diffusion transformer and does not require any additional inputs - e.g., a fitted parametric model or camera parameters of the input image. We pre-train Pippo on 3B human images without captions, and conduct multi-view mid-training and post-training on studio captured humans. During mid-training, to quickly absorb the studio dataset, we denoise several (up to 48) views at low-resolution, and encode target cameras coarsely using a shallow MLP. During post-training, we denoise fewer views at high-resolution and use pixel-aligned controls (e.g., Spatial anchor and Plucker rays) to enable 3D consistent generations. At inference, we propose an attention biasing technique that allows Pippo to simultaneously generate greater than 5 times as many views as seen during training. Finally, we also introduce an improved metric to evaluate 3D consistency of multi-view generations, and show that Pippo outperforms existing works on <PRE_TAG>multi-view human generation</POST_TAG> from a single image.

View Paper