< Explain other AI papers

SOP: A Scalable Online Post-Training System for Vision-Language-Action Models

Mingjie Pan, Siyuan Feng, Qinglin Zhang, Xinchen Li, Jianheng Song, Chendi Qu, Yi Wang, Chuankang Li, Ziyu Xiong, Zhi Chen, Yi Liu, Jianlan Luo

2026-01-07

SOP: A Scalable Online Post-Training System for Vision-Language-Action Models

Summary

This paper introduces a new system called SOP, which stands for Scalable Online Post-training, designed to improve how robots learn from real-world experiences. It focuses on making robots better at performing a variety of tasks after they've already been initially trained on a lot of data.

What's the problem?

While robots can be pre-trained on massive datasets to give them a general understanding of the world, they often struggle when it comes to performing specific tasks with expert-level precision in real-life situations. Existing methods for improving robot performance after initial training are often limited because they require a lot of manual setup, only work with one robot at a time, or are designed for a single task. This makes it hard to adapt robots quickly and efficiently to new environments and tasks using real-world data.

What's the solution?

The SOP system solves this by connecting a group of robots directly to a central computer for learning. The robots continuously try to perform tasks and send information about their attempts – both successes and failures – back to the computer. If a robot needs help, a human can provide guidance, and that information is also sent back. The computer then uses this data to update the robot's programming, and sends the improved instructions back to all the robots in the group. This happens constantly, allowing the robots to learn and improve in real-time, across multiple tasks, and without losing their general abilities. They tested this with tasks like folding clothes, assembling boxes, and restocking shelves, using both imitation learning and reinforcement learning techniques.

Why it matters?

This research is important because it shows a way to efficiently train robots to be truly useful in the real world. By allowing robots to learn continuously from experience and from each other, and by scaling up the learning process with a fleet of robots, SOP makes it possible to create robots that are more adaptable, reliable, and capable of handling a wide range of tasks. It suggests that connecting online learning with deploying many robots at once is key to making robot learning practical and effective.

Abstract

Vision-language-action (VLA) models achieve strong generalization through large-scale pre-training, but real-world deployment requires expert-level task proficiency in addition to broad generality. Existing post-training approaches for VLA models are typically offline, single-robot, or task-specific, limiting effective on-policy adaptation and scalable learning from real-world interaction. We introduce a Scalable Online Post-training (SOP) system that enables online, distributed, multi-task post-training of generalist VLA models directly in the physical world. SOP tightly couples execution and learning through a closed-loop architecture in which a fleet of robots continuously streams on-policy experience and human intervention signals to a centralized cloud learner, and asynchronously receives updated policies. This design supports prompt on-policy correction, scales experience collection through parallel deployment, and preserves generality during adaptation. SOP is agnostic to the choice of post-training algorithm; we instantiate it with both interactive imitation learning (HG-DAgger) and reinforcement learning (RECAP). Across a range of real-world manipulation tasks including cloth folding, box assembly, and grocery restocking, we show that SOP substantially improves the performance of large pretrained VLA models while maintaining a single shared policy across tasks. Effective post-training can be achieved within hours of real-world interaction, and performance scales near-linearly with the number of robots in the fleet. These results suggest that tightly coupling online learning with fleet-scale deployment is instrumental to enabling efficient, reliable, and scalable post-training of generalist robot policies in the physical world.