RoboChallenge: Large-scale Real-robot Evaluation of Embodied Policies
Adina Yakefu, Bin Xie, Chongyang Xu, Enwen Zhang, Erjin Zhou, Fan Jia, Haitao Yang, Haoqiang Fan, Haowei Zhang, Hongyang Peng, Jing Tan, Junwen Huang, Kai Liu, Kaixin Liu, Kefan Gu, Qinglun Zhang, Ruitao Zhang, Saike Huang, Shen Cheng, Shuaicheng Liu, Tiancai Wang, Tiezhen Wang
2025-11-05
Summary
This paper details the creation of RoboChallenge, a system designed for thoroughly testing robotic control algorithms, particularly those using advanced learning techniques like Value-function Approximation (VLA) models.
What's the problem?
Testing robotic algorithms is crucial, but it becomes incredibly difficult when you need to evaluate many different algorithms across a wide range of tasks. Simply put, it's hard to test a lot of 'brains' in a lot of 'bodies' reliably and consistently, especially as the number of tests grows. Ensuring that tests can be repeated and produce the same results (reproducibility) adds another layer of complexity when scaling up the testing process.
What's the solution?
The authors built RoboChallenge, an online platform that allows researchers to submit and test their robotic control algorithms. They then used this platform to evaluate several state-of-the-art VLA models using a specific set of tasks called Table30, creating a benchmark for comparing different approaches. Essentially, they created a standardized 'playground' for robotic algorithms and ran some initial tests to show how it works.
Why it matters?
This work is important because it provides a standardized and scalable way to test and compare robotic control algorithms. Having a common platform like RoboChallenge helps accelerate progress in the field by making it easier for researchers to share their work, reproduce results, and build upon each other's ideas. It’s like creating a common measuring stick for robot 'intelligence'.
Abstract
Testing on real machines is indispensable for robotic control algorithms. In the context of learning-based algorithms, especially VLA models, demand for large-scale evaluation, i.e. testing a large number of models on a large number of tasks, is becoming increasingly urgent. However, doing this right is highly non-trivial, especially when scalability and reproducibility is taken into account. In this report, we describe our methodology for constructing RoboChallenge, an online evaluation system to test robotic control algorithms, and our survey of recent state-of-the-art VLA models using our initial benchmark Table30.