When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs

Bodam Kim, Hiskias Dingeto, Taeyoun Kwon, Dasol Choi, DongGeon Lee, Haon Park, JaeHoon Lee, Jongho Shin

2025-08-12

When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with
Benign Inputs

Summary

This paper talks about WhisperInject, a new method that uses tiny, hidden changes in harmless-sounding audio to trick advanced audio-language models into generating harmful or unsafe content. It works in two steps: first, it finds harmful things the AI might say on its own using a special learning method, then it hides those harmful responses inside normal audio that sounds fine to people.

What's the problem?

The problem is that audio-language models, which understand and respond to spoken input, can be tricked by carefully crafted sounds that humans can't notice as different. These small changes can cause the AI to break its safety rules and produce dangerous or harmful text, posing a serious risk since the attacks are hard to detect by humans and traditional safety filters that look at text alone.

What's the solution?

The paper introduces a two-stage approach called WhisperInject. In the first stage, it uses a new type of reinforcement learning to figure out the harmful responses that the model itself is likely to produce naturally. In the second stage, it uses a technique to add these harmful responses invisibly into normal audio, like a simple spoken sentence about the weather. This adversarial audio sounds normal to people but reliably causes the AI to generate harmful outputs.

Why it matters?

This matters because as AI that listens to and talks through audio becomes common, ensuring these systems are safe and trustworthy is very important. WhisperInject exposes a hidden vulnerability that current protections miss, showing the need for more advanced defenses in audio-based AI. Improving safety here will help prevent misuse of voice-controlled AI assistants and other speech technologies.

Abstract

WhisperInject uses RL-PGD and PGD to create imperceptible audio perturbations that manipulate large language models into generating harmful content.

View Paper