Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration (2509.14760v1)

Published 18 Sep 2025 in cs.CL

Abstract: LLMs are increasingly applied in diverse real-world scenarios, each governed by bespoke behavioral and safety specifications (spec) custom-tailored by users or organizations. These spec, categorized into safety-spec and behavioral-spec, vary across scenarios and evolve with changing preferences and requirements. We formalize this challenge as specification alignment, focusing on LLMs' ability to follow dynamic, scenario-specific spec from both behavioral and safety perspectives. To address this challenge, we propose Align3, a lightweight method that employs Test-Time Deliberation (TTD) with hierarchical reflection and revision to reason over the specification boundaries. We further present SpecBench, a unified benchmark for measuring specification alignment, covering 5 scenarios, 103 spec, and 1,500 prompts. Experiments on 15 reasoning and 18 instruct models with several TTD methods, including Self-Refine, TPO, and MoreThink, yield three key findings: (i) test-time deliberation enhances specification alignment; (ii) Align3 advances the safety-helpfulness trade-off frontier with minimal overhead; (iii) SpecBench effectively reveals alignment gaps. These results highlight the potential of test-time deliberation as an effective strategy for reasoning over the real-world specification boundaries.

Summary

The paper introduces SpecBench and Align3, a method to enhance LLM adherence to complex behavioral and safety specifications.
It formalizes specification alignment as a constrained optimization problem balancing behavioral scores with safety risks during inference.
Empirical results show Align3 improves the Specification Alignment Rate by up to 11.89%, highlighting trade-offs and potential for practical deployment.

Enhancing Specification Alignment in LLMs via Test-Time Deliberation

Introduction and Motivation

The paper addresses the challenge of specification alignment in LLMs, defined as the ability of LLMs to adhere to dynamic, scenario-specific behavioral and safety specifications. Unlike prior work that applies uniform safety or instruction-following standards, this work formalizes the need for LLMs to reason over both behavioral-spec (content preferences, goal orientation, format) and safety-spec (adaptable safety boundaries) that are tailored to real-world scenarios and can evolve over time. The authors introduce a new benchmark, SpecBench, and a lightweight test-time deliberation (TTD) method, Align3, to systematically evaluate and improve specification alignment.

Figure 1: Overview of the proposed framework, including scenario-specific specification alignment, the construction and evaluation pipeline of SpecBench, and the integration of test-time deliberation methods such as Align3.

Formalization of Specification Alignment

The paper formalizes the specification alignment problem as a constrained optimization at inference time. Given a prompt $x$ , a reasoning trace $y$ , and a final response $z$ , the objective is to maximize the expected behavioral score $r_{\text{beh}}(x, z)$ (proportion of behavioral-spec satisfied) while ensuring the expected safety risk $\text{Risk}_{\text{safety}}(x, z)$ (likelihood or severity of safety-spec violations) does not exceed a budget $\epsilon$ :

$\max_{y}\; \mathbb{E}_{x \sim \mathcal{P}_{\text{test}},\; z \sim p_{\theta}(\cdot \mid x, y)} \bigl[\, r_{\text{beh}}(x,z) \bigr] \quad \text{s.t.}\ \mathbb{E}_{x,z}\bigl[\text{Risk}_{\text{safety}}(x,z)\bigr] \le \epsilon$

This formulation captures the safety-behavior trade-off: maximizing helpfulness (behavioral alignment) is often at odds with minimizing risk (safety alignment), especially under ambiguous or evolving specification boundaries.

SpecBench: A Unified Benchmark for Specification Alignment

SpecBench is introduced as a comprehensive benchmark for evaluating LLMs' specification alignment. It covers five diverse scenarios (Biochemical Procedure Instruction, Child-Oriented Storytelling, Code Development Secure Operation, Personal Health Education, and Travel Itinerary Planning), each with 20+ scenario-specific specifications (split between safety and behavioral), and 1,500 prompts (including both safe and adversarial unsafe prompts).

The data curation process combines synthetic prompt generation (using LLMs to create prompts that intentionally violate specific safety-spec) and curation from existing datasets, followed by semantic and embedding-based filtering, human-in-the-loop quality control, and adversarial attack enhancement (using WildTeaming to simulate real-world jailbreak tactics).

Figure 2: Annotation interface for human evaluation, showing scenario, prompt, response, and the corresponding safety and behavioral specifications.

The evaluation protocol uses Specification Alignment Rate (SAR) as the main metric, which assigns a score of zero to any unsafe response and a value in $[\alpha, 1]$ to safe responses, where $\alpha$ is a tunable offset reflecting the baseline reward for safety. Behavioral compliance is measured as the proportion of behavioral-spec satisfied, and safety is a hard constraint.

Empirical Evaluation and Key Findings

The authors evaluate 33 LLMs (18 instruct, 15 reasoning) across open- and closed-source families, using both default and TTD-enhanced inference. Key findings include:

Performance Gaps: Most models score below 65% SAR; GPT-5-chat achieves 82.14%, outperforming GPT-4.1 by 12.94%. There is a clear scaling effect within model families.
Reasoning Models Outperform Instruct Models: Reasoning variants (e.g., Qwen3-32B-thinking) consistently outperform their instruct counterparts, with SAR gains up to 14.87%.
Safety-Behavior Trade-off: Models with explicit safety alignment (e.g., RealSafe-R1-8B, STAIR-Llama-3.1-8B-DPO-3) achieve high safety but low behavioral scores due to over-refusal, while others achieve higher behavioral scores at the cost of safety.
Figure 3: SAR performance of all LLMs across five scenarios, highlighting both overall performance and scenario-level variation.

Align3: Test-Time Deliberation for Specification Alignment

Align3 is a lightweight TTD method that decouples behavioral and safety optimization in a three-step process:

Behavior Optimization: Behavioral-spec are provided to maximize helpfulness and task relevance.
Safety-Guided Refinement: Safety-spec are introduced before finalizing the reasoning trace, prompting the model to remove safety risks and ensure compliance.
Holistic Specification Audit: All specifications are provided for a final audit and gap-filling, ensuring the response meets both behavioral and safety requirements.

This progressive enforcement reduces safety violations and improves alignment with minimal token overhead (typically <2k tokens per sample), in contrast to multi-pass TTD methods that require orders of magnitude more compute.

Empirical results show that Align3 achieves up to 11.89% SAR improvement over vanilla reasoning, with strong gains in both safety and behavioral scores, and outperforms other TTD baselines (Best-of-N, Self-Refine, TPO, MoreThink) in both effectiveness and efficiency.

Analysis: Trade-offs, Scenario Variation, and Specification Judgements

The analysis section provides several insights:

Safety-Behavior Trade-off: On unsafe prompts, safety scores drop significantly, and behavioral scores are also lower, confirming the intrinsic trade-off.
Scenario Variation: SAR is consistently lower in Code and Biochem scenarios due to stricter or more ambiguous safety boundaries, while Health and Travel are less challenging.
Specification Judgement Patterns: No single specification is disproportionately difficult or trivial, indicating well-balanced scenario design.

Figure 4: Specification judgements of Llama-3.1-8B-Instruct across all scenarios, showing the distribution of YES, NO, and NA labels for each specification.

Figure 5: Specification judgements of DeepSeek-R1, illustrating higher compliance rates compared to instruct models.

Human Evaluation and Evaluator Consistency

A human evaluation paper demonstrates high agreement (Cohen's Kappa = 0.84) between GPT-4.1 and expert annotators, with an average SAR gap of only 6.5%. This validates the reliability of LLM-based evaluators for large-scale specification alignment assessment.

Implications and Future Directions

This work establishes a rigorous framework for evaluating and optimizing LLM alignment with scenario-specific, evolving specifications. The introduction of SpecBench and the Align3 TTD method enables:

Practical, scenario-aware alignment: Organizations can define and enforce custom behavioral and safety boundaries without retraining models.
Efficient deployment: Align3 provides substantial alignment gains with minimal inference overhead, making it suitable for real-world applications where retraining is infeasible.
Benchmarking and diagnosis: SpecBench exposes alignment gaps and trade-offs, guiding model selection and further research.

Future developments may include extending SpecBench to more domains, integrating richer forms of user feedback, and developing adaptive TTD strategies that dynamically adjust to specification changes or user preferences. The formalization of specification alignment as a constrained optimization problem also opens avenues for principled algorithmic advances in safe and helpful LLM deployment.

Conclusion

The paper provides a comprehensive framework for specification alignment in LLMs, introducing both a unified benchmark (SpecBench) and an efficient TTD method (Align3) that advances the safety-helpfulness Pareto frontier. The empirical results demonstrate that test-time deliberation is an effective and scalable strategy for real-world alignment, and the methodology sets a new standard for scenario-aware evaluation and optimization of LLM behavior.