Open-World Skill Discovery from Unsegmented Demonstrations
Jingwen Deng, Zihao Wang, Shaofei Cai, Anji Liu, Yitao Liang
2025-03-17
Summary
This paper talks about an AI method that automatically finds and learns skills from long, unedited gameplay videos (like Minecraft tutorials) by spotting where one skill ends and another begins.
What's the problem?
Current AI struggles to learn skills from raw videos because they’re too long and messy, and splitting them manually or using fixed rules doesn’t work well for real-world tasks.
What's the solution?
The SBD method uses a pre-trained AI to predict actions in videos, then looks for prediction mistakes as clues to split videos into skill segments, like how humans notice when a task changes.
Why it matters?
This helps AI learn real-world skills faster and better, using free online videos to train robots, game characters, or virtual assistants without needing human labels.
Abstract
Learning skills in open-world environments is essential for developing agents capable of handling a variety of tasks by combining basic skills. Online demonstration videos are typically long but unsegmented, making them difficult to segment and label with skill identifiers. Unlike existing methods that rely on sequence sampling or human labeling, we have developed a self-supervised learning-based approach to segment these long videos into a series of semantic-aware and skill-consistent segments. Drawing inspiration from human cognitive event segmentation theory, we introduce Skill Boundary Detection (SBD), an annotation-free temporal video segmentation algorithm. SBD detects skill boundaries in a video by leveraging prediction errors from a pretrained unconditional action-prediction model. This approach is based on the assumption that a significant increase in prediction error indicates a shift in the skill being executed. We evaluated our method in Minecraft, a rich open-world simulator with extensive gameplay videos available online. Our SBD-generated segments improved the average performance of conditioned policies by 63.7% and 52.1% on short-term atomic skill tasks, and their corresponding hierarchical agents by 11.3% and 20.8% on long-horizon tasks. Our method can leverage the diverse YouTube videos to train instruction-following agents. The project page can be found in https://craftjarvis.github.io/SkillDiscovery.