VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models

Ying Cheng, Yu-Ho Lin, Min-Hung Chen, Fu-En Yang, Shang-Hong Lai

2025-11-11

VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models

Summary

This paper introduces VADER, a new system designed to not just *detect* unusual things happening in videos, but to actually *understand* why they're unusual and explain what's going on.

What's the problem?

Current video anomaly detection systems are good at spotting something weird is happening and where it is, but they don't really explain *why* it's weird. They often miss the important connections between objects and how their interactions cause the anomaly. For example, a system might see someone fall, but not understand that they tripped over an object, which is the real reason for the fall.

What's the solution?

VADER uses a combination of techniques. First, it identifies which parts of the video are anomalous. Then, it focuses on the moments *around* those anomalies to understand the context. It also analyzes how objects in the video are interacting with each other, creating a 'relationship map'. Finally, it uses a powerful language model (like the ones behind chatbots) to take all this visual and relational information and generate a detailed explanation of what happened and why it's unusual, and can even answer questions about the anomaly.

Why it matters?

This work is important because it moves beyond simply identifying problems in videos to actually understanding them. This is crucial for applications like security surveillance, automated quality control, and self-driving cars, where it's not enough to know *something* went wrong – you need to know *what* went wrong and *why* to prevent it from happening again.

Abstract

Video anomaly understanding (VAU) aims to provide detailed interpretation and semantic comprehension of anomalous events within videos, addressing limitations of traditional methods that focus solely on detecting and localizing anomalies. However, existing approaches often neglect the deeper causal relationships and interactions between objects, which are critical for understanding anomalous behaviors. In this paper, we propose VADER, an LLM-driven framework for Video Anomaly unDErstanding, which integrates keyframe object Relation features with visual cues to enhance anomaly comprehension from video. Specifically, VADER first applies an Anomaly Scorer to assign per-frame anomaly scores, followed by a Context-AwarE Sampling (CAES) strategy to capture the causal context of each anomalous event. A Relation Feature Extractor and a COntrastive Relation Encoder (CORE) jointly model dynamic object interactions, producing compact relational representations for downstream reasoning. These visual and relational cues are integrated with LLMs to generate detailed, causally grounded descriptions and support robust anomaly-related question answering. Experiments on multiple real-world VAU benchmarks demonstrate that VADER achieves strong results across anomaly description, explanation, and causal reasoning tasks, advancing the frontier of explainable video anomaly analysis.

View Paper