Controlling Language and Diffusion Models by Transporting Activations
Pau Rodriguez, Arno Blaas, Michal Klein, Luca Zappella, Nicholas Apostoloff, Marco Cuturi, Xavier Suau
2024-11-06

Summary
This paper presents a new method called Activation Transport (AcT) that helps control how large generative models, like language and image models, produce content by managing their internal processes.
What's the problem?
As large generative models become more popular, there are concerns about their reliability and potential misuse. These models can sometimes produce harmful or inaccurate content, which is a significant issue in applications like chatbots and image generation.
What's the solution?
The authors developed AcT, a framework that uses principles from optimal transport theory to guide the internal activations of these models. This allows for precise control over what the model generates without significantly increasing the computational load. AcT can reduce harmful content, encourage desired concepts, and improve the accuracy of the outputs for both language models and text-to-image models.
Why it matters?
This work is crucial because it enhances the safety and effectiveness of AI systems. By improving how these models generate content, it can help prevent harmful outcomes and ensure that they provide more accurate information, making them safer for users in various applications.
Abstract
The increasing capabilities of large generative models and their ever more widespread deployment have raised concerns about their reliability, safety, and potential misuse. To address these issues, recent works have proposed to control model generation by steering model activations in order to effectively induce or prevent the emergence of concepts or behaviors in the generated output. In this paper we introduce Activation Transport (AcT), a general framework to steer activations guided by optimal transport theory that generalizes many previous activation-steering works. AcT is modality-agnostic and provides fine-grained control over the model behavior with negligible computational overhead, while minimally impacting model abilities. We experimentally show the effectiveness and versatility of our approach by addressing key challenges in large language models (LLMs) and text-to-image diffusion models (T2Is). For LLMs, we show that AcT can effectively mitigate toxicity, induce arbitrary concepts, and increase their truthfulness. In T2Is, we show how AcT enables fine-grained style control and concept negation.