Almost Surely Safe Alignment of Large Language Models at Inference-Time

Xiaotong Ji, Shyam Sundhar Ramesh, Matthieu Zimmer, Ilija Bogunovic, Jun Wang, Haitham Bou Ammar

2025-02-04

Almost Surely Safe Alignment of Large Language Models at Inference-Time

Summary

This paper talks about a new method called InferenceGuard, which helps large language models (LLMs) give safe and accurate responses without needing expensive retraining. It focuses on ensuring that these AI systems avoid unsafe or biased outputs while maintaining their usefulness.

What's the problem?

Even the most advanced LLMs can sometimes generate harmful or biased responses. Current methods to fix this, like retraining with reinforcement learning, are very costly and can cause the model to overfit, making it less flexible. There’s a need for a more efficient way to ensure safety during the model's use without changing its core structure.

What's the solution?

The researchers introduced InferenceGuard, a system that uses a mathematical framework called a constrained Markov decision process (CMDP) to monitor and control the safety of the model's responses in real-time. By adding a safety state that tracks constraints during response generation, InferenceGuard ensures that outputs remain safe without modifying the model’s weights. Experiments showed that it outperforms other methods in balancing safety and task performance.

Why it matters?

This research is important because it provides a practical and efficient way to make AI systems safer without requiring expensive retraining. It helps prevent harmful or biased outputs while keeping the AI useful for various tasks. This advancement could make AI more reliable and trustworthy in real-world applications like customer support, education, and healthcare.

Abstract

Even highly capable large language models (LLMs) can produce biased or unsafe responses, and alignment techniques, such as RLHF, aimed at mitigating this issue, are expensive and prone to overfitting as they retrain the LLM. This paper introduces a novel inference-time alignment approach that ensures LLMs generate safe responses almost surely, i.e., with a probability approaching one. We achieve this by framing the safe generation of inference-time responses as a constrained Markov decision process within the LLM's latent space. Crucially, we augment a safety state that tracks the evolution of safety constraints and enables us to demonstrate formal safety guarantees upon solving the MDP in the latent space. Building on this foundation, we propose InferenceGuard, a practical implementation that safely aligns LLMs without modifying the model weights. Empirically, we demonstrate InferenceGuard effectively balances safety and task performance, outperforming existing inference-time alignment methods in generating safe and aligned responses.

View Paper