ExposeAnyone: Personalized Audio-to-Expression Diffusion Models Are Robust Zero-Shot Face Forgery Detectors

Kaede Shiohara, Toshihiko Yamasaki, Vladislav Golyanik

2026-01-07

ExposeAnyone: Personalized Audio-to-Expression Diffusion Models Are Robust Zero-Shot Face Forgery Detectors

Summary

This paper introduces a new way to detect deepfakes, which are fake videos created using artificial intelligence, without needing to be specifically trained on examples of those fakes.

What's the problem?

Detecting deepfakes is really hard because current methods rely on learning what *existing* deepfakes look like. This means if someone creates a new type of deepfake, the detectors often fail. They essentially memorize the tricks used in old fakes instead of learning to spot anything that just doesn't look right about a real face. Existing methods that *try* to avoid this by learning from the videos themselves haven't been very good at actually identifying fakes.

What's the solution?

The researchers developed a system called ExposeAnyone that uses a 'diffusion model'. Think of it like this: the system learns how faces move and change expressions based on audio – it learns to generate realistic facial movements from speech. Then, when you show it a video, it checks how well the face in the video matches the expected movements based on the audio. If there's a big difference, it suggests the video might be a deepfake. The system is 'personalized' to specific people, making it even better at spotting fakes of those individuals.

Why it matters?

This research is important because it significantly improves deepfake detection, even for brand new types of fakes like those created by the Sora AI model, which previous methods struggled with. It's also more reliable when videos are blurry or compressed, making it practical for real-world use. This is a big step towards being able to trust the videos we see online.

Abstract

Detecting unknown deepfake manipulations remains one of the most challenging problems in face forgery detection. Current state-of-the-art approaches fail to generalize to unseen manipulations, as they primarily rely on supervised training with existing deepfakes or pseudo-fakes, which leads to overfitting to specific forgery patterns. In contrast, self-supervised methods offer greater potential for generalization, but existing work struggles to learn discriminative representations only from self-supervision. In this paper, we propose ExposeAnyone, a fully self-supervised approach based on a diffusion model that generates expression sequences from audio. The key idea is, once the model is personalized to specific subjects using reference sets, it can compute the identity distances between suspected videos and personalized subjects via diffusion reconstruction errors, enabling person-of-interest face forgery detection. Extensive experiments demonstrate that 1) our method outperforms the previous state-of-the-art method by 4.22 percentage points in the average AUC on DF-TIMIT, DFDCP, KoDF, and IDForge datasets, 2) our model is also capable of detecting Sora2-generated videos, where the previous approaches perform poorly, and 3) our method is highly robust to corruptions such as blur and compression, highlighting the applicability in real-world face forgery detection.

View Paper