Rapidly Adapting to New Voice Spoofing: Few-Shot Detection of Synthesized Speech Under Distribution Shifts
Ashi Garg, Zexin Cai, Henry Li Xinyuan, Leibny Paola García-Perera, Kevin Duh, Sanjeev Khudanpur, Matthew Wiesner, Nicholas Andrews
2025-08-20
Summary
This paper presents a new method for detecting fake speech that can adapt quickly to new, unseen types of fake speech, even when only a few examples are available.
What's the problem?
It's hard to detect fake speech, especially when the fake speech is made using methods, voices, languages, or audio qualities that weren't used when the detection system was originally trained. This is called a 'distribution shift'.
What's the solution?
The researchers created a special kind of network called a self-attentive prototypical network that can learn from just a few examples of real speech to get better at spotting fake speech that's different from what it was initially trained on. This is a type of 'few-shot learning'.
Why it matters?
This research is important because it helps build more reliable fake speech detectors that can stay effective even as new ways of creating fake speech emerge, making it harder to trust audio content online. It significantly improves detection accuracy when faced with these new challenges.
Abstract
We address the challenge of detecting synthesized speech under distribution shifts -- arising from unseen synthesis methods, speakers, languages, or audio conditions -- relative to the training data. Few-shot learning methods are a promising way to tackle distribution shifts by rapidly adapting on the basis of a few in-distribution samples. We propose a self-attentive prototypical network to enable more robust few-shot adaptation. To evaluate our approach, we systematically compare the performance of traditional zero-shot detectors and the proposed few-shot detectors, carefully controlling training conditions to introduce distribution shifts at evaluation time. In conditions where distribution shifts hamper the zero-shot performance, our proposed few-shot adaptation technique can quickly adapt using as few as 10 in-distribution samples -- achieving upto 32% relative EER reduction on deepfakes in Japanese language and 20% relative reduction on ASVspoof 2021 Deepfake dataset.