FastHMR: Accelerating Human Mesh Recovery via Token and Layer Merging with Diffusion Decoding

Soroush Mehraban, Andrea Iaboni, Babak Taati

2025-10-14

FastHMR: Accelerating Human Mesh Recovery via Token and Layer Merging with Diffusion Decoding

Summary

This paper presents a way to make 3D human pose estimation from images faster and more efficient without losing accuracy.

What's the problem?

Current methods for figuring out the 3D pose of a person from an image use complex computer models called transformers, which are very good but require a lot of computing power and time. These models also process a lot of unnecessary information, making them even slower.

What's the solution?

The researchers came up with two techniques to simplify these models. First, they selectively remove some of the layers within the transformer if removing them doesn't significantly worsen the pose estimation. Second, they merge together parts of the image that don't contain much useful information about the person's pose, like the background. To make sure the pose estimation remains accurate after simplifying the model, they also added a new component that considers how poses change over time and uses knowledge of typical human movements.

Why it matters?

This work is important because it allows for real-time 3D human pose estimation, which is crucial for applications like virtual reality, augmented reality, and motion capture. By making the process faster, it opens up possibilities for using these technologies in more places and with less powerful hardware.

Abstract

Recent transformer-based models for 3D Human Mesh Recovery (HMR) have achieved strong performance but often suffer from high computational cost and complexity due to deep transformer architectures and redundant tokens. In this paper, we introduce two HMR-specific merging strategies: Error-Constrained Layer Merging (ECLM) and Mask-guided Token Merging (Mask-ToMe). ECLM selectively merges transformer layers that have minimal impact on the Mean Per Joint Position Error (MPJPE), while Mask-ToMe focuses on merging background tokens that contribute little to the final prediction. To further address the potential performance drop caused by merging, we propose a diffusion-based decoder that incorporates temporal context and leverages pose priors learned from large-scale motion capture datasets. Experiments across multiple benchmarks demonstrate that our method achieves up to 2.3x speed-up while slightly improving performance over the baseline.

View Paper