Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models

Thomas Winninger, Boussad Addad, Katarzyna Kapusta

2025-03-18

Using Mechanistic Interpretability to Craft Adversarial Attacks against
Large Language Models

Summary

This collection of research paper titles explores recent advances and challenges in AI, covering areas like image and video generation, language models, robotics, and multimodal learning.

What's the problem?

The problems addressed include enhancing the quality, efficiency, and control of AI-generated content, improving the reasoning abilities of AI models, mitigating biases and safety risks, enabling AI to better interact with the real world, and creating more personalized and versatile AI systems.

What's the solution?

The solutions involve developing new models, training techniques, benchmarks, and evaluation methods. These include innovations in diffusion models, transformers, reinforcement learning, and multimodal learning. Specific solutions focus on improving image editing (Edit Transfer), generating consistent videos (CINEMA, Long Context Tuning), enabling robots to navigate and manipulate objects (UniGoal, adversarial data collection), adding speech to text-based models (From TOWER to SPIRE), and making AI models more fair (Group-robust Machine Unlearning).

Why it matters?

These advancements are important because they push the boundaries of AI capabilities, making AI more powerful, reliable, and beneficial for various applications. They also address critical challenges related to safety, fairness, and transparency, ensuring that AI is developed and deployed responsibly.

Abstract

Traditional white-box methods for creating adversarial perturbations against LLMs typically rely only on gradient computation from the targeted model, ignoring the internal mechanisms responsible for attack success or failure. Conversely, interpretability studies that analyze these internal mechanisms lack practical applications beyond runtime interventions. We bridge this gap by introducing a novel white-box approach that leverages mechanistic interpretability techniques to craft practical adversarial inputs. Specifically, we first identify acceptance subspaces - sets of feature vectors that do not trigger the model's refusal mechanisms - then use gradient-based optimization to reroute embeddings from refusal subspaces to acceptance subspaces, effectively achieving jailbreaks. This targeted approach significantly reduces computation cost, achieving attack success rates of 80-95\% on state-of-the-art models including Gemma2, Llama3.2, and Qwen2.5 within minutes or even seconds, compared to existing techniques that often fail or require hours of computation. We believe this approach opens a new direction for both attack research and defense development. Furthermore, it showcases a practical application of mechanistic interpretability where other methods are less efficient, which highlights its utility. The code and generated datasets are available at https://github.com/Sckathach/subspace-rerouting.

View Paper