GeoMotionGPT: Geometry-Aligned Motion Understanding with Large Language Models

Zhankai Ye, Bofan Li, Yukai Jin, Shuoqiu Li, Wei Wang, Yanfu Zhang, Shangqian Gao, Xin Liu

2026-01-14

GeoMotionGPT: Geometry-Aligned Motion Understanding with Large Language Models

Summary

This paper explores how to better use Large Language Models (LLMs) to understand and work with human motion data, like movements captured in videos. It focuses on improving how motion is represented so LLMs can reason about it more effectively.

What's the problem?

Currently, systems that use LLMs for motion understanding treat the way motion is broken down into smaller parts (quantization) and the way those parts are given meaning (semantic embedding) as separate steps. They just link them with simple IDs. This doesn't allow the LLM to truly grasp the relationships *within* the motion itself – how different movements connect geometrically – which limits its ability to understand complex motions and reason about them accurately.

What's the solution?

The researchers propose a new system that makes sure the way motion is broken down and the way the LLM understands it are based on the same underlying geometric principles. They do this by forcing both the motion 'codebook' (the set of motion parts) and the LLM's internal representation of meaning to be organized in a very structured, 'orthogonal' way, meaning the parts are independent of each other. They use a special type of decoder and a 'sparse projection' to map motion codes into the LLM's understanding, and they carefully control the training process to maintain this geometric alignment without losing the LLM's ability to learn the actual meaning of the motions.

Why it matters?

This work is important because it significantly improves the performance of LLMs on motion understanding tasks, showing a 20% improvement over existing methods. By creating a unified geometric basis, the LLM can better understand the nuances of human movement, opening up possibilities for more realistic and intelligent motion-based applications like animation, robotics, and human-computer interaction.

Abstract

Discrete motion tokenization has recently enabled Large Language Models (LLMs) to serve as versatile backbones for motion understanding and motion-language reasoning. However, existing pipelines typically decouple motion quantization from semantic embedding learning, linking them solely via token IDs. This approach fails to effectively align the intrinsic geometry of the motion space with the embedding space, thereby hindering the LLM's capacity for nuanced motion reasoning. We argue that alignment is most effective when both modalities share a unified geometric basis. Therefore, instead of forcing the LLM to reconstruct the complex geometry among motion tokens from scratch, we present a novel framework that explicitly enforces orthogonality on both the motion codebook and the LLM embedding space, ensuring that their relational structures naturally mirror each other. Specifically, we employ a decoder-only quantizer with Gumbel-Softmax for differentiable training and balanced codebook usage. To bridge the modalities, we use a sparse projection that maps motion codes into the LLM embedding space while preserving orthogonality. Finally, a two-stage orthonormal regularization schedule enforces soft constraints during tokenizer training and LLM fine-tuning to maintain geometric alignment without hindering semantic adaptation. Extensive experiments on HumanML3D demonstrate that our framework achieves a 20% performance improvement over current state-of-the-art methods, validating that a unified geometric basis effectively empowers the LLM for nuanced motion reasoning.

View Paper