Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration

Haoran Zhang, Yafu Li, Xuyang Hu, Dongrui Liu, Zhilin Wang, Bo Li, Yu Cheng

2025-09-19

Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration

Summary

This paper investigates how well large language models (LLMs) can follow specific instructions and safety guidelines that are different for each situation they're used in, and how those instructions can change over time.

What's the problem?

LLMs are becoming more common, but each use case – whether it's a chatbot, a writing assistant, or something else – requires its own set of rules about how the model should behave and what it should avoid saying. These rules, covering both safety and desired behavior, are unique to each situation and can also change as people's needs evolve. The core issue is making sure LLMs consistently follow these dynamic, specific instructions.

What's the solution?

The researchers developed a method called Align3, which helps LLMs think through the instructions *while* they're being used, not just during training. It's a lightweight process where the model reflects on the instructions, revises its thinking, and then generates a response. They also created a new testing benchmark called SpecBench, which includes a variety of scenarios, specific instructions, and questions to thoroughly evaluate how well LLMs align with these rules.

Why it matters?

This work is important because it shows how to improve the reliability and safety of LLMs in real-world applications. By allowing models to carefully consider instructions at the time of use, and by providing a better way to test their alignment, we can build more trustworthy and helpful AI systems that adapt to changing needs and expectations.

Abstract

Large language models (LLMs) are increasingly applied in diverse real-world scenarios, each governed by bespoke behavioral and safety specifications (spec) custom-tailored by users or organizations. These spec, categorized into safety-spec and behavioral-spec, vary across scenarios and evolve with changing preferences and requirements. We formalize this challenge as specification alignment, focusing on LLMs' ability to follow dynamic, scenario-specific spec from both behavioral and safety perspectives. To address this challenge, we propose Align3, a lightweight method that employs Test-Time Deliberation (TTD) with hierarchical reflection and revision to reason over the specification boundaries. We further present SpecBench, a unified benchmark for measuring specification alignment, covering 5 scenarios, 103 spec, and 1,500 prompts. Experiments on 15 reasoning and 18 instruct models with several TTD methods, including Self-Refine, TPO, and MoreThink, yield three key findings: (i) test-time deliberation enhances specification alignment; (ii) Align3 advances the safety-helpfulness trade-off frontier with minimal overhead; (iii) SpecBench effectively reveals alignment gaps. These results highlight the potential of test-time deliberation as an effective strategy for reasoning over the real-world specification boundaries.

View Paper