Activation Approximations Can Incur Safety Vulnerabilities Even in Aligned LLMs: Comprehensive Analysis and Defense
Jiawen Zhang, Kejia Chen, Lipeng He, Jian Lou, Dan Li, Zunlei Feng, Mingli Song, Jian Liu, Kui Ren, Xiaohu Yang
2025-02-05
Summary
This paper talks about how a method called activation approximation, used to make large language models (LLMs) faster and more efficient, can create safety problems even in models that are designed to be safe. The researchers studied these risks and proposed a solution to reduce the vulnerabilities.
What's the problem?
LLMs are very powerful but require a lot of computing resources, especially in situations where devices have limited capabilities. Activation approximation is a technique used to make these models run faster, but it can unintentionally weaken their safety features, allowing them to produce harmful or unsafe responses.
What's the solution?
The researchers conducted a detailed safety evaluation of activation approximation techniques and found that they consistently reduce safety in LLMs. Based on their findings, they developed a method called QuadA, which strengthens the safety of LLMs while still allowing them to benefit from activation approximations. QuadA is easy to implement and ensures that the models remain safe and efficient.
Why it matters?
This research is important because it highlights hidden risks in making AI models more efficient and provides a way to address these risks. By improving both the speed and safety of LLMs, this work makes them more reliable for real-world applications where safety is critical.
Abstract
Large Language Models (LLMs) have showcased remarkable capabilities across various domains. Accompanying the evolving capabilities and expanding deployment scenarios of LLMs, their deployment challenges escalate due to their sheer scale and the advanced yet complex activation designs prevalent in notable model series, such as Llama, Gemma, and Mistral. These challenges have become particularly pronounced in resource-constrained deployment scenarios, where mitigating inference efficiency bottlenecks is imperative. Among various recent efforts, activation approximation has emerged as a promising avenue for pursuing inference efficiency, sometimes considered indispensable in applications such as private inference. Despite achieving substantial speedups with minimal impact on utility, even appearing sound and practical for real-world deployment, the safety implications of activation approximations remain unclear. In this work, we fill this critical gap in LLM safety by conducting the first systematic safety evaluation of activation approximations. Our safety vetting spans seven sota techniques across three popular categories, revealing consistent safety degradation across ten safety-aligned LLMs.