SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models

Seanie Lee, Dong Bok Lee, Dominik Wagner, Minki Kang, Haebin Seong, Tobias Bocklet, Juho Lee, Sung Ju Hwang

2025-02-19

SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety
Guardrails in Large Language Models

Summary

This paper talks about SafeRoute, a new system for making AI language models safer and more efficient. It's like having a smart traffic cop that decides which safety checks an AI's responses need to go through.

What's the problem?

Big AI language models need safety guards to prevent harmful outputs, but these guards can be very slow and use a lot of computer power. Smaller, faster guards exist, but they sometimes miss dangerous content that the bigger guards would catch.

What's the solution?

The researchers created SafeRoute, which acts like a smart filter. It quickly checks each AI response and decides whether it needs the big, thorough safety check or if the smaller, faster check is enough. This way, only the tricky or potentially dangerous responses get the full safety treatment, while the safe, easy ones can be processed quickly.

Why it matters?

This matters because it makes AI systems safer without slowing them down too much. It's like having the best of both worlds - the safety of strict checks and the speed of lighter ones. This could help make AI chatbots and assistants safer to use in real-world situations, like customer service or online help, without making them frustratingly slow.

Abstract

Deploying large language models (LLMs) in real-world applications requires robust safety guard models to detect and block harmful user prompts. While large safety guard models achieve strong performance, their computational cost is substantial. To mitigate this, smaller distilled models are used, but they often underperform on "hard" examples where the larger model provides accurate predictions. We observe that many inputs can be reliably handled by the smaller model, while only a small fraction require the larger model's capacity. Motivated by this, we propose SafeRoute, a binary router that distinguishes hard examples from easy ones. Our method selectively applies the larger safety guard model to the data that the router considers hard, improving efficiency while maintaining accuracy compared to solely using the larger safety guard model. Experimental results on multiple benchmark datasets demonstrate that our adaptive model selection significantly enhances the trade-off between computational cost and safety performance, outperforming relevant baselines.

View Paper