What Matters in Detecting AI-Generated Videos like Sora?
Chirui Chang, Zhengzhe Liu, Xiaoyang Lyu, Xiaojuan Qi
2024-07-03

Summary
This paper talks about a study that investigates how to detect videos created by AI, specifically focusing on the differences between real videos and those generated by models like Sora and Stable Video Diffusion.
What's the problem?
The main problem is that as AI technology improves, it becomes harder to tell the difference between real videos and those made by AI. This can lead to confusion and misinformation, especially since AI-generated videos can look very realistic. Understanding how to detect these fake videos is crucial to prevent potential misuse.
What's the solution?
To tackle this issue, the researchers analyzed the differences between real and AI-generated videos from three key angles: appearance (how they look), motion (how things move), and geometry (the depth and structure of the scenes). They trained three different classifiers using advanced techniques to recognize these aspects. Their findings showed that even though AI-generated videos can be very convincing, there are still noticeable differences that can help identify them. They also developed an Ensemble-of-Experts model that combines all three detection methods for better accuracy.
Why it matters?
This research is important because it helps us understand how to spot fake videos, which is essential in a world where misinformation can spread easily. By improving our ability to detect AI-generated content, we can protect ourselves from deception and ensure that people can trust what they see in videos.
Abstract
Recent advancements in diffusion-based video generation have showcased remarkable results, yet the gap between synthetic and real-world videos remains under-explored. In this study, we examine this gap from three fundamental perspectives: appearance, motion, and geometry, comparing real-world videos with those generated by a state-of-the-art AI model, Stable Video Diffusion. To achieve this, we train three classifiers using 3D convolutional networks, each targeting distinct aspects: vision foundation model features for appearance, optical flow for motion, and monocular depth for geometry. Each classifier exhibits strong performance in fake video detection, both qualitatively and quantitatively. This indicates that AI-generated videos are still easily detectable, and a significant gap between real and fake videos persists. Furthermore, utilizing the Grad-CAM, we pinpoint systematic failures of AI-generated videos in appearance, motion, and geometry. Finally, we propose an Ensemble-of-Experts model that integrates appearance, optical flow, and depth information for fake video detection, resulting in enhanced robustness and generalization ability. Our model is capable of detecting videos generated by Sora with high accuracy, even without exposure to any Sora videos during training. This suggests that the gap between real and fake videos can be generalized across various video generative models. Project page: https://justin-crchang.github.io/3DCNNDetection.github.io/