Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection
Enshen Zhou, Qi Su, Cheng Chi, Zhizheng Zhang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, He Wang
2024-12-06

Summary
This paper talks about Code-as-Monitor (CaM), a new system that helps robots detect and prevent failures in real-time by using a combination of visual programming and advanced monitoring techniques.
What's the problem?
Robots often face unexpected problems during their tasks, and existing methods struggle to both react to these problems after they occur and prevent them before they happen. This dual challenge makes it hard for robots to operate effectively in complex environments.
What's the solution?
To address this issue, the authors developed CaM, which uses a vision-language model to monitor robot activities in real-time. They treat the tasks of detecting and preventing failures as a set of problems that can be solved together. By introducing geometric elements that represent different constraints, CaM simplifies the monitoring process and improves accuracy. Experiments showed that CaM significantly outperformed previous methods, achieving a 28.7% higher success rate and reducing execution time by nearly one-third.
Why it matters?
This research is important because it enhances the reliability of robotic systems, allowing them to work more effectively in dynamic environments where unexpected challenges can arise. By improving failure detection and prevention, CaM could lead to safer and more efficient robotic applications in industries like manufacturing, healthcare, and autonomous vehicles.
Abstract
Automatic detection and prevention of open-set failures are crucial in closed-loop robotic systems. Recent studies often struggle to simultaneously identify unexpected failures reactively after they occur and prevent foreseeable ones proactively. To this end, we propose Code-as-Monitor (CaM), a novel paradigm leveraging the vision-language model (VLM) for both open-set reactive and proactive failure detection. The core of our method is to formulate both tasks as a unified set of spatio-temporal constraint satisfaction problems and use VLM-generated code to evaluate them for real-time monitoring. To enhance the accuracy and efficiency of monitoring, we further introduce constraint elements that abstract constraint-related entities or their parts into compact geometric elements. This approach offers greater generality, simplifies tracking, and facilitates constraint-aware visual programming by leveraging these elements as visual prompts. Experiments show that CaM achieves a 28.7% higher success rate and reduces execution time by 31.8% under severe disturbances compared to baselines across three simulators and a real-world setting. Moreover, CaM can be integrated with open-loop control policies to form closed-loop systems, enabling long-horizon tasks in cluttered scenes with dynamic environments.