HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents

Tencent Robotics X, HY Vision Team, Xumin Yu, Zuyan Liu, Ziyi Wang, He Zhang, Yongming Rao, Fangfu Liu, Yani Zhang, Ruowen Zhao, Oran Wang, Yves Liang, Haitao Lin, Minghui Wang, Yubo Dong, Kevin Cheng, Bolin Ni, Rui Huang, Han Hu, Zhengyou Zhang, Linus, Shunyu Yao

2026-04-10

HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents

Summary

This paper introduces HY-Embodied-0.5, a new set of AI models designed to help robots understand and interact with the real world more effectively. These models are built to be better at processing visual information over time and space, and at reasoning about how to achieve goals in a physical environment.

What's the problem?

Existing AI models, like those that understand both images and language, aren't quite good enough for robots. Robots need to not only 'see' and 'understand' but also figure out where things are, how they change, and what actions to take. Current models struggle with the specific demands of being *embodied* – meaning having a physical presence and needing to act in the world.

What's the solution?

The researchers created two versions of HY-Embodied-0.5: a smaller, faster model for use on robots directly, and a larger, more powerful model for complex problem-solving. They used a special architecture called a Mixture-of-Transformers to help the models focus on important visual details. They also developed a training method where the model continuously improves itself. Finally, they transferred the knowledge from the large model to the smaller one, making the smaller model perform better. They tested these models on a variety of tasks and even used them to control a real robot.

Why it matters?

This work is important because it brings us closer to robots that can truly understand and navigate the world around them. By creating AI models specifically for robots, and demonstrating strong performance on various tasks and even in real-world robot control, this research paves the way for more capable and helpful robots in the future. The open-sourcing of the code and models allows other researchers to build upon this work.

Abstract

We introduce HY-Embodied-0.5, a family of foundation models specifically designed for real-world embodied agents. To bridge the gap between general Vision-Language Models (VLMs) and the demands of embodied agents, our models are developed to enhance the core capabilities required by embodied intelligence: spatial and temporal visual perception, alongside advanced embodied reasoning for prediction, interaction, and planning. The HY-Embodied-0.5 suite comprises two primary variants: an efficient model with 2B activated parameters designed for edge deployment, and a powerful model with 32B activated parameters targeted for complex reasoning. To support the fine-grained visual perception essential for embodied tasks, we adopt a Mixture-of-Transformers (MoT) architecture to enable modality-specific computing. By incorporating latent tokens, this design effectively enhances the perceptual representation of the models. To improve reasoning capabilities, we introduce an iterative, self-evolving post-training paradigm. Furthermore, we employ on-policy distillation to transfer the advanced capabilities of the large model to the smaller variant, thereby maximizing the performance potential of the compact model. Extensive evaluations across 22 benchmarks, spanning visual perception, spatial reasoning, and embodied understanding, demonstrate the effectiveness of our approach. Our MoT-2B model outperforms similarly sized state-of-the-art models on 16 benchmarks, while the 32B variant achieves performance comparable to frontier models such as Gemini 3.0 Pro. In downstream robot control experiments, we leverage our robust VLM foundation to train an effective Vision-Language-Action (VLA) model, achieving compelling results in real-world physical evaluations. Code and models are open-sourced at https://github.com/Tencent-Hunyuan/HY-Embodied.

View Paper