Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge
Ilia Larchenko, Gleb Zarin, Akash Karnatak
2025-12-15
Summary
This paper details a computer program, a 'vision-action policy', that excelled in a robotics competition called the BEHAVIOR Challenge. The program is designed to perform everyday household tasks in a realistic computer simulation, like manipulating objects with two hands and navigating a virtual home.
What's the problem?
Creating a robot that can reliably perform complex, multi-step tasks like those found in a home is incredibly difficult. Existing programs often struggle with long sequences of actions, require a lot of training data, and can produce jerky or illogical movements. The BEHAVIOR Challenge specifically tests a robot's ability to handle these kinds of tasks in a diverse and demanding environment.
What's the solution?
The researchers built upon an existing program framework called Pi0.5 and added several key improvements. They introduced a new technique called 'correlated noise' which makes the training process more efficient and helps the program create smoother, more natural-looking actions. They also used 'attention' mechanisms to help the program focus on the most important parts of the scene and a 'System 2' approach to resolve any confusion about what to do next. Finally, they optimized the program for speed and accuracy using techniques like action compression and specific rules tailored to the challenge.
Why it matters?
This work represents a significant step forward in the field of robotic control. By achieving a top score in the BEHAVIOR Challenge, the researchers demonstrated a program capable of handling a wide range of complex tasks with a high degree of success. This could eventually lead to more capable and helpful robots in our homes and workplaces, assisting with chores and other everyday activities.
Abstract
We present a vision-action policy that won 1st place in the 2025 BEHAVIOR Challenge - a large-scale benchmark featuring 50 diverse long-horizon household tasks in photo-realistic simulation, requiring bimanual manipulation, navigation, and context-aware decision making. Building on the Pi0.5 architecture, we introduce several innovations. Our primary contribution is correlated noise for flow matching, which improves training efficiency and enables correlation-aware inpainting for smooth action sequences. We also apply learnable mixed-layer attention and System 2 stage tracking for ambiguity resolution. Training employs multi-sample flow matching to reduce variance, while inference uses action compression and challenge-specific correction rules. Our approach achieves 26% q-score across all 50 tasks on both public and private leaderboards.