RoboCurate: Harnessing Diversity with Action-Verified Neural Trajectory for Robot Learning

Seungku Kim, Suhyeok Jang, Byungjun Yoon, Dongyoung Kim, John Won, Jinwoo Shin

2026-02-24

RoboCurate: Harnessing Diversity with Action-Verified Neural Trajectory for Robot Learning

Summary

This paper introduces a new system called RoboCurate that improves how robots learn from videos created by computers, specifically focusing on making sure those videos are realistic and helpful for training.

What's the problem?

When robots learn from videos generated by AI, the videos aren't always perfect; the actions shown might not be physically possible or look quite right, which makes it hard for the robot to learn effectively. Existing methods using AI to check video quality can tell if a video *looks* wrong, but they can't specifically evaluate if the *actions* in the video are actually doable by a robot.

What's the solution?

RoboCurate solves this by essentially 'testing' the actions in the generated videos. It takes the actions predicted in the video and replays them in a physics simulator. Then, it compares what happens in the simulation to what's shown in the video, checking if the movements are consistent. If they aren't, the video is filtered out. The system also enhances the variety of the training data by changing the appearance of objects and videos while keeping the actions the same.

Why it matters?

This is important because it allows robots to learn much more effectively from computer-generated videos. The results show a significant improvement in the robot's ability to perform tasks – up to a 179.9% increase in success rate in complex real-world scenarios – compared to learning only from real-world videos or using flawed synthetic data. This means robots can be trained faster and more reliably, even for difficult tasks.

Abstract

Synthetic data generated by video generative models has shown promise for robot learning as a scalable pipeline, but it often suffers from inconsistent action quality due to imperfectly generated videos. Recently, vision-language models (VLMs) have been leveraged to validate video quality, but they have limitations in distinguishing physically accurate videos and, even then, cannot directly evaluate the generated actions themselves. To tackle this issue, we introduce RoboCurate, a novel synthetic robot data generation framework that evaluates and filters the quality of annotated actions by comparing them with simulation replay. Specifically, RoboCurate replays the predicted actions in a simulator and assesses action quality by measuring the consistency of motion between the simulator rollout and the generated video. In addition, we unlock observation diversity beyond the available dataset via image-to-image editing and apply action-preserving video-to-video transfer to further augment appearance. We observe RoboCurate's generated data yield substantial relative improvements in success rates compared to using real data only, achieving +70.1% on GR-1 Tabletop (300 demos), +16.1% on DexMimicGen in the pre-training setup, and +179.9% in the challenging real-world ALLEX humanoid dexterous manipulation setting.

View Paper