Sapiens2

Rawal Khirodkar, He Wen, Julieta Martinez, Yuan Dong, Su Zhaoen, Shunsuke Saito

2026-04-28

Summary

This paper introduces Sapiens2, a new family of computer vision models designed to be really good at understanding images that include people. These models come in different sizes and can handle very detailed, high-resolution pictures, even up to 4K.

What's the problem?

Existing computer vision models often struggle to accurately understand images of people in a variety of situations. They might not pick up on fine details, or they might not generalize well to new tasks without a lot of specific training. The goal was to create a model that could handle both detailed understanding and adapt to different tasks easily.

What's the solution?

The researchers tackled this problem in a few key ways. First, they trained the model using a combination of techniques – having it reconstruct images from parts of them and also learn to distinguish between similar images. Second, they used a huge dataset of a billion high-quality images of people and improved the labels used to train the model. Finally, they made changes to the model’s internal structure to allow for more stable and longer training, and used a technique called 'windowed attention' to help it process larger images effectively.

Why it matters?

Sapiens2 represents a significant step forward in human-centric vision. It achieves better results than previous models on tasks like estimating a person’s pose, identifying body parts, and understanding surface normals. It also opens the door to new applications like estimating point clouds and albedo, meaning it can help computers 'see' and understand people in images more accurately, which is important for things like robotics, virtual reality, and image editing.

Abstract

We present Sapiens2, a model family of high-resolution transformers for human-centric vision focused on generalization, versatility, and high-fidelity outputs. Our model sizes range from 0.4 to 5 billion parameters, with native 1K resolution and hierarchical variants that support 4K. Sapiens2 substantially improves over its predecessor in both pretraining and post-training. First, to learn features that capture low-level details (for dense prediction) and high-level semantics (for zero-shot or few-label settings), we combine masked image reconstruction with self-distilled contrastive objectives. Our evaluations show that this unified pretraining objective is better suited for a wider range of downstream tasks. Second, along the data axis, we pretrain on a curated dataset of 1 billion high-quality human images and improve the quality and quantity of task annotations. Third, architecturally, we incorporate advances from frontier models that enable longer training schedules with improved stability. Our 4K models adopt windowed attention to reason over longer spatial context and are pretrained with 2K output resolution. Sapiens2 sets a new state-of-the-art and improves over the first generation on pose (+4 mAP), body-part segmentation (+24.3 mIoU), normal estimation (45.6% lower angular error) and extends to new tasks such as pointmap and albedo estimation. Code: https://github.com/facebookresearch/sapiens2

View Paper