Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations

Bo-Han Feng, Chien-Feng Liu, Yu-Hsuan Li Liang, Chih-Kai Yang, Szu-Wei Fu, Zhehuai Chen, Ke-Han Lu, Sung-Feng Huang, Chao-Han Huck Yang, Yu-Chiang Frank Wang, Yun-Nung Chen, Hung-yi Lee

2025-10-24

Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations

Summary

This paper examines how well large audio and language models, which combine understanding of speech and text, handle potentially harmful requests when those requests are delivered with different emotions.

What's the problem?

Current large language models are pretty good at understanding what you *say*, but they haven't been thoroughly tested on *how* you say it. Specifically, the researchers noticed that a person's emotional tone – like being angry, sad, or happy – could change whether the model gives a safe or unsafe response to a potentially dangerous request. They suspected that models might be more easily tricked into doing something harmful if the request is delivered with a certain emotion.

What's the solution?

The researchers created a collection of spoken instructions that were designed to be harmful, but were spoken with a variety of emotions and levels of intensity. They then tested several advanced audio-language models to see how they responded to these emotionally charged requests. They found that different emotions definitely led to different responses, and surprisingly, moderate levels of emotional expression were often the most dangerous.

Why it matters?

This research is important because it reveals a hidden weakness in these models. If a model's safety can be compromised simply by changing the speaker's emotion, it's not reliable for real-world use. The findings suggest that developers need to specifically train these models to be robust against emotional manipulation, ensuring they remain safe no matter how someone expresses a request.

Abstract

Large audio-language models (LALMs) extend text-based LLMs with auditory understanding, offering new opportunities for multimodal applications. While their perception, reasoning, and task performance have been widely studied, their safety alignment under paralinguistic variation remains underexplored. This work systematically investigates the role of speaker emotion. We construct a dataset of malicious speech instructions expressed across multiple emotions and intensities, and evaluate several state-of-the-art LALMs. Our results reveal substantial safety inconsistencies: different emotions elicit varying levels of unsafe responses, and the effect of intensity is non-monotonic, with medium expressions often posing the greatest risk. These findings highlight an overlooked vulnerability in LALMs and call for alignment strategies explicitly designed to ensure robustness under emotional variation, a prerequisite for trustworthy deployment in real-world settings.

View Paper