AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM

Sunghyun Ahn, Youngwan Jo, Kijung Lee, Sein Kwon, Inpyo Hong, Sanghyun Park

2025-03-10

AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM

Summary

This paper talks about AnyAnomaly, a new AI system that can detect unusual events in videos based on text descriptions provided by users, without needing to be retrained for different environments

What's the problem?

Current video anomaly detection systems are trained to recognize specific patterns of normal behavior, which means they don't work well in new situations. To use these systems in different places, people need to retrain them or make new ones, which takes a lot of time, money, and expert knowledge

What's the solution?

The researchers created AnyAnomaly, which uses a large AI model that understands both images and text. Users can describe what they consider unusual in words, and AnyAnomaly will find those events in videos. They tested AnyAnomaly on different datasets and found it worked really well, even beating other methods on some tests

Why it matters?

This matters because it makes video surveillance more flexible and easier to use in different situations. Instead of needing to create new AI models for each new place or type of unusual event, people can just describe what they're looking for in words. This could make video monitoring systems more useful and accessible for things like security, safety, and traffic management

Abstract

Video anomaly detection (VAD) is crucial for video analysis and surveillance in computer vision. However, existing VAD models rely on learned normal patterns, which makes them difficult to apply to diverse environments. Consequently, users should retrain models or develop separate AI models for new environments, which requires expertise in machine learning, high-performance hardware, and extensive data collection, limiting the practical usability of VAD. To address these challenges, this study proposes customizable video anomaly detection (C-VAD) technique and the AnyAnomaly model. C-VAD considers user-defined text as an abnormal event and detects frames containing a specified event in a video. We effectively implemented AnyAnomaly using a context-aware visual question answering without fine-tuning the large vision language model. To validate the effectiveness of the proposed model, we constructed C-VAD datasets and demonstrated the superiority of AnyAnomaly. Furthermore, our approach showed competitive performance on VAD benchmark datasets, achieving state-of-the-art results on the UBnormal dataset and outperforming other methods in generalization across all datasets. Our code is available online at github.com/SkiddieAhn/Paper-AnyAnomaly.

View Paper