GR-3 Technical Report

Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, Hao Niu, Wenxuan Ou, Wanli Peng, Zeyu Ren, Haixin Shi, Jiawen Tian, Hongtao Wu, Xin Xiao, Yuyang Xiao, Jiafeng Xu, Yichu Yang

2025-07-22

Summary

This paper talks about GR-3, a large AI model that combines vision, language, and action to give robots the ability to understand their surroundings and follow complex instructions to perform tasks.

What's the problem?

The problem is that many AI systems struggle to handle new objects, environments, or abstract instructions, and they need a lot of training data to learn how to perform a variety of tasks.

What's the solution?

The authors built GR-3 using a combination of training on internet-scale vision-language data, human demonstration data collected through virtual reality, and robot movement data. This makes GR-3 very good at generalizing to new situations and completing long, complex tasks like using both hands or moving around a space. They also designed a versatile robot called ByteMini to work with GR-3.

Why it matters?

This matters because GR-3 helps create more capable and reliable robots that can assist humans in everyday life by understanding instructions and interacting with the physical world in smart and flexible ways.

Abstract

GR-3, a large-scale vision-language-action model, demonstrates exceptional generalization and fine-tuning capabilities, excelling in complex tasks and outperforming state-of-the-art methods.

View Paper