Sapiens: Foundation for Human Vision Models
Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, Shunsuke Saito
2024-08-23

Summary
This paper presents Sapiens, a set of models designed to perform four important tasks related to human vision, including estimating body poses and understanding depth.
What's the problem?
Creating models that can understand human-centric tasks is difficult because they often need a lot of data and can struggle with generalizing their learning to new situations. Additionally, traditional models may not perform well when they encounter images that differ from what they were trained on.
What's the solution?
The authors developed Sapiens, which includes models that have been trained on over 300 million images of humans. These models can easily adapt to different tasks by fine-tuning their training. They also designed the models to work well with high-resolution images and found that using self-supervised learning significantly improves performance even when labeled data is limited.
Why it matters?
This research is important because it enhances the ability of AI systems to understand human behavior and interactions in various contexts. By improving how these models work, we can apply them in areas like healthcare, sports analysis, and robotics, leading to better tools for understanding and assisting human activities.
Abstract
We present Sapiens, a family of models for four fundamental human-centric vision tasks - 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Our models natively support 1K high-resolution inference and are extremely easy to adapt for individual tasks by simply fine-tuning models pretrained on over 300 million in-the-wild human images. We observe that, given the same computational budget, self-supervised pretraining on a curated dataset of human images significantly boosts the performance for a diverse set of human-centric tasks. The resulting models exhibit remarkable generalization to in-the-wild data, even when labeled data is scarce or entirely synthetic. Our simple model design also brings scalability - model performance across tasks improves as we scale the number of parameters from 0.3 to 2 billion. Sapiens consistently surpasses existing baselines across various human-centric benchmarks. We achieve significant improvements over the prior state-of-the-art on Humans-5K (pose) by 7.6 mAP, Humans-2K (part-seg) by 17.1 mIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5% relative angular error.