Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

Yifei Li, Wenzhao Zheng, Yanran Zhang, Runze Sun, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu

2025-12-18

Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

Summary

This paper focuses on the growing problem of fake videos created by artificial intelligence and introduces a new system, Skyra, designed to detect these videos and explain *why* it believes they are fake.

What's the problem?

AI is getting really good at making videos that look real, but are actually completely fabricated. Current methods to detect these fake videos usually just say 'yes' or 'no' – they don't tell you *what* specifically in the video makes it look AI-generated. This lack of explanation makes it hard to trust the detection and understand how to improve the technology. There also wasn't a large, well-labeled dataset available to train and test these detection systems effectively.

What's the solution?

The researchers created Skyra, a sophisticated AI model that not only identifies AI-generated videos but also points out the specific visual flaws – like weird distortions or unnatural movements – that give them away. To train Skyra, they built a large dataset called ViF-CoT-4K, filled with fake videos and detailed notes about the specific artifacts present. They then used a two-step training process to help Skyra learn to recognize these flaws, explain its reasoning, and accurately detect fake videos. They also created a new testing benchmark, ViF-Bench, to rigorously evaluate Skyra's performance.

Why it matters?

Being able to reliably detect and understand AI-generated fake videos is crucial because these videos can be used to spread misinformation, damage reputations, or even influence elections. Skyra’s ability to *explain* its detections is a big step forward, making the technology more trustworthy and helping researchers develop even better methods for identifying and combating the spread of deepfakes.

Abstract

The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. To comprehensively evaluate Skyra, we introduce ViF-Bench, a benchmark comprising 3K high-quality samples generated by over ten state-of-the-art video generators. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, while our evaluation yields valuable insights for advancing explainable AI-generated video detection.

View Paper