AiOS is a novel approach to 3D whole-body human mesh recovery that aims to address limitations of existing two-stage methods. Developed by researchers from institutions including SenseTime Research, City University of Hong Kong, and Nanyang Technological University, AiOS performs human pose and shape estimation in a single stage, without requiring a separate human detection step.
The key innovation of AiOS is its all-in-one-stage design that processes the full image frame end-to-end. This is in contrast to previous top-down approaches that first detect and crop individual humans before estimating pose and shape. By operating on the full image, AiOS preserves important contextual information and inter-person relationships that can be lost when cropping.
AiOS is built on the DETR (DEtection TRansformer) architecture and frames multi-person whole-body mesh recovery as a progressive set prediction problem. It uses a series of transformer decoder stages to localize humans and estimate their pose and shape parameters in a coarse-to-fine manner.
The first stage uses "human tokens" to identify coarse human locations and encode global features for each person. Subsequent stages refine these initial estimates, using "joint tokens" to extract more fine-grained local features around body parts. This progressive refinement allows AiOS to handle challenging cases like occlusions.
By estimating pose and shape for the full body, hands, and face in a unified framework, AiOS is able to capture expressive whole-body poses. It outputs parameters for the SMPL-X parametric human body model, providing a detailed 3D mesh representation of each person.
The researchers evaluated AiOS on several benchmark datasets for 3D human pose and shape estimation. Compared to previous state-of-the-art methods, AiOS achieved significant improvements, including a 9% reduction in normalized mesh vertex error (NMVE) on the AGORA dataset and a 30% reduction in per-vertex error (PVE) on EHF.
Key features of AiOS include:
- Single-stage, end-to-end architecture for multi-person pose and shape estimation
- Operates on full image frames without requiring separate human detection
- Progressive refinement using transformer decoder stages
- Unified estimation of body, hand, and face pose/shape
- Outputs SMPL-X body model parameters
- State-of-the-art performance on multiple 3D human pose datasets
- Effective for challenging scenarios like occlusions and crowded scenes
- Built on DETR transformer architecture