UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing
Yiheng Li, Ruibing Hou, Hong Chang, Shiguang Shan, Xilin Chen
2024-11-28

Summary
This paper introduces UniPose, a new framework that uses advanced language models to understand, generate, and edit human poses in various formats, including images, text, and 3D models.
What's the problem?
Many existing methods for working with human poses only focus on one type of input (like just images or just text) and often operate separately. This limits their usefulness in real-world applications where different types of information are needed together to create or modify poses.
What's the solution?
UniPose combines multiple types of data by using a 'pose tokenizer' that converts 3D poses into easy-to-handle tokens. It also uses different visual encoders to improve how well it understands poses. This unified approach allows the framework to learn from various tasks and adapt to new ones effectively, making it easier to work with human poses across different contexts.
Why it matters?
This research is important because it represents a significant step forward in how we can manipulate and understand human poses using technology. By integrating different data types into one framework, UniPose can help improve applications in animation, virtual reality, gaming, and other fields where understanding human movement is crucial.
Abstract
Human pose plays a crucial role in the digital age. While recent works have achieved impressive progress in understanding and generating human poses, they often support only a single modality of control signals and operate in isolation, limiting their application in real-world scenarios. This paper presents UniPose, a framework employing Large Language Models (LLMs) to comprehend, generate, and edit human poses across various modalities, including images, text, and 3D SMPL poses. Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary. To further enhance the fine-grained pose perception capabilities, we facilitate UniPose with a mixture of visual encoders, among them a pose-specific visual encoder. Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities. This work serves as the first attempt at building a general-purpose framework for pose comprehension, generation, and editing. Extensive experiments highlight UniPose's competitive and even superior performance across various pose-relevant tasks.