MetaSC: Test-Time Safety Specification Optimization for Language Models

Víctor Gallego

2025-02-13

MetaSC: Test-Time Safety Specification Optimization for Language Models

Summary

This paper talks about MetaSC, a new way to make AI language models safer and more responsible when they're actually being used, without having to change the core programming of the AI.

What's the problem?

Current AI language models can sometimes give unsafe or harmful responses, especially when people try to trick them. The usual ways of making AIs safer, like giving them fixed rules or having them check their own answers, don't always work well enough.

What's the solution?

The researchers created MetaSC, which uses something called a meta-critique mechanism. This system constantly updates the safety instructions given to the AI while it's working. It's like having a smart coach that keeps giving the AI better advice on how to stay safe and ethical as it faces different challenges. They tested this on several AI models and found it worked much better than the old methods at keeping the AI's responses safe and honest.

Why it matters?

This matters because as AI becomes more common in our daily lives, we need to make sure it's safe and trustworthy. MetaSC could help make AI assistants that are better at avoiding harmful or dishonest responses, even when people try to trick them. This could make AI more reliable for important tasks and help people trust AI systems more.

Abstract

We propose a novel dynamic safety framework that optimizes language model (LM) safety reasoning at inference time without modifying model weights. Building on recent advances in self-critique methods, our approach leverages a meta-critique mechanism that iteratively updates safety prompts-termed specifications-to drive the critique and revision process adaptively. This test-time optimization not only improves performance against adversarial jailbreak requests but also in diverse general safety-related tasks, such as avoiding moral harm or pursuing honest responses. Our empirical evaluations across several language models demonstrate that dynamically optimized safety prompts yield significantly higher safety scores compared to fixed system prompts and static self-critique defenses. Code to be released at https://github.com/vicgalle/meta-self-critique.git .

View Paper