SAKE: Towards Editing Auditory Attribute Knowledge of Large Audio-Language Models
Chih-Kai Yang, Yen-Ting Piao, Tzu-Wen Hsu, Szu-Wei Fu, Zhehuai Chen, Ke-Han Lu, Sung-Feng Huang, Chao-Han Huck Yang, Yu-Chiang Frank Wang, Yun-Nung Chen, Hung-yi Lee
2025-10-24
Summary
This paper introduces a new way to test how well we can change what large audio and language models 'know' about sounds, specifically things like how loud or fast a sound is, rather than just facts like a name or date.
What's the problem?
Currently, most research on updating a model's knowledge focuses on text or images. No one has really looked at how to efficiently change what these models understand about *sounds*. It's harder than just changing a fact because sound understanding involves more abstract qualities, and changing one thing about a sound shouldn't mess up other things the model knows about it. Also, edits need to work consistently even when multiple changes are made one after another.
What's the solution?
The researchers created a benchmark called SAKE, which is a set of tests designed to evaluate how well different methods can edit auditory attributes in large audio-language models. They tested seven different editing methods on two different models, looking at how reliable the changes were, how well they applied to different situations, how localized the changes were to the sound itself (and not affecting other things), and how easily the edits could be transferred to other models.
Why it matters?
This work is important because it opens up a new area of research for improving these models. Being able to easily update what models 'know' about sounds is crucial for making them more useful in real-world applications like voice assistants, sound editing software, and understanding complex audio environments. It helps us move beyond models that just understand text and images to ones that can truly understand the world around us through sound.
Abstract
Knowledge editing offers an efficient way to update model knowledge without full retraining, but prior work has concentrated almost exclusively on textual or visual modalities. We introduce SAKE, the first benchmark specifically designed for editing auditory attribute knowledge in Large Audio-Language Models (LALMs). Unlike factual updates, SAKE targets several abstract auditory attributes, capturing knowledge types that go beyond conventional textual and visual domains. We benchmark seven editing methods on two LALMs along four dimensions: reliability, generality, audio/text locality, and portability. Results highlight challenges such as preserving intra-attribute knowledge unrelated to the edit, generalizing edits to multimodal reasoning, and maintaining edits under sequential updates. SAKE provides a principled framework to study how knowledge editing extends to the auditory modalities, opening new directions for maintaining and adapting LALMs in more diverse real-world scenarios.