- The paper introduces SpecBench and Align3, a method to enhance LLM adherence to complex behavioral and safety specifications.
- It formalizes specification alignment as a constrained optimization problem balancing behavioral scores with safety risks during inference.
- Empirical results show Align3 improves the Specification Alignment Rate by up to 11.89%, highlighting trade-offs and potential for practical deployment.
Enhancing Specification Alignment in LLMs via Test-Time Deliberation
Introduction and Motivation
The paper addresses the challenge of specification alignment in LLMs, defined as the ability of LLMs to adhere to dynamic, scenario-specific behavioral and safety specifications. Unlike prior work that applies uniform safety or instruction-following standards, this work formalizes the need for LLMs to reason over both behavioral-spec (content preferences, goal orientation, format) and safety-spec (adaptable safety boundaries) that are tailored to real-world scenarios and can evolve over time. The authors introduce a new benchmark, SpecBench, and a lightweight test-time deliberation (TTD) method, Align3, to systematically evaluate and improve specification alignment.
Figure 1: Overview of the proposed framework, including scenario-specific specification alignment, the construction and evaluation pipeline of SpecBench, and the integration of test-time deliberation methods such as Align3.
The paper formalizes the specification alignment problem as a constrained optimization at inference time. Given a prompt x, a reasoning trace y, and a final response z, the objective is to maximize the expected behavioral score rbeh(x,z) (proportion of behavioral-spec satisfied) while ensuring the expected safety risk Risksafety(x,z) (likelihood or severity of safety-spec violations) does not exceed a budget ϵ:
ymaxEx∼Ptest,z∼pθ(⋅∣x,y)[rbeh(x,z)]s.t. Ex,z[Risksafety(x,z)]≤ϵ
This formulation captures the safety-behavior trade-off: maximizing helpfulness (behavioral alignment) is often at odds with minimizing risk (safety alignment), especially under ambiguous or evolving specification boundaries.
SpecBench: A Unified Benchmark for Specification Alignment
SpecBench is introduced as a comprehensive benchmark for evaluating LLMs' specification alignment. It covers five diverse scenarios (Biochemical Procedure Instruction, Child-Oriented Storytelling, Code Development Secure Operation, Personal Health Education, and Travel Itinerary Planning), each with 20+ scenario-specific specifications (split between safety and behavioral), and 1,500 prompts (including both safe and adversarial unsafe prompts).
The data curation process combines synthetic prompt generation (using LLMs to create prompts that intentionally violate specific safety-spec) and curation from existing datasets, followed by semantic and embedding-based filtering, human-in-the-loop quality control, and adversarial attack enhancement (using WildTeaming to simulate real-world jailbreak tactics).
Figure 2: Annotation interface for human evaluation, showing scenario, prompt, response, and the corresponding safety and behavioral specifications.
The evaluation protocol uses Specification Alignment Rate (SAR) as the main metric, which assigns a score of zero to any unsafe response and a value in [α,1] to safe responses, where α is a tunable offset reflecting the baseline reward for safety. Behavioral compliance is measured as the proportion of behavioral-spec satisfied, and safety is a hard constraint.
Empirical Evaluation and Key Findings
The authors evaluate 33 LLMs (18 instruct, 15 reasoning) across open- and closed-source families, using both default and TTD-enhanced inference. Key findings include:
Align3: Test-Time Deliberation for Specification Alignment
Align3 is a lightweight TTD method that decouples behavioral and safety optimization in a three-step process:
- Behavior Optimization: Behavioral-spec are provided to maximize helpfulness and task relevance.
- Safety-Guided Refinement: Safety-spec are introduced before finalizing the reasoning trace, prompting the model to remove safety risks and ensure compliance.
- Holistic Specification Audit: All specifications are provided for a final audit and gap-filling, ensuring the response meets both behavioral and safety requirements.
This progressive enforcement reduces safety violations and improves alignment with minimal token overhead (typically <2k tokens per sample), in contrast to multi-pass TTD methods that require orders of magnitude more compute.
Empirical results show that Align3 achieves up to 11.89% SAR improvement over vanilla reasoning, with strong gains in both safety and behavioral scores, and outperforms other TTD baselines (Best-of-N, Self-Refine, TPO, MoreThink) in both effectiveness and efficiency.
Analysis: Trade-offs, Scenario Variation, and Specification Judgements
The analysis section provides several insights:
- Safety-Behavior Trade-off: On unsafe prompts, safety scores drop significantly, and behavioral scores are also lower, confirming the intrinsic trade-off.
- Scenario Variation: SAR is consistently lower in Code and Biochem scenarios due to stricter or more ambiguous safety boundaries, while Health and Travel are less challenging.
- Specification Judgement Patterns: No single specification is disproportionately difficult or trivial, indicating well-balanced scenario design.

Figure 4: Specification judgements of Llama-3.1-8B-Instruct across all scenarios, showing the distribution of YES, NO, and NA labels for each specification.
Figure 5: Specification judgements of DeepSeek-R1, illustrating higher compliance rates compared to instruct models.
Human Evaluation and Evaluator Consistency
A human evaluation paper demonstrates high agreement (Cohen's Kappa = 0.84) between GPT-4.1 and expert annotators, with an average SAR gap of only 6.5%. This validates the reliability of LLM-based evaluators for large-scale specification alignment assessment.
Implications and Future Directions
This work establishes a rigorous framework for evaluating and optimizing LLM alignment with scenario-specific, evolving specifications. The introduction of SpecBench and the Align3 TTD method enables:
- Practical, scenario-aware alignment: Organizations can define and enforce custom behavioral and safety boundaries without retraining models.
- Efficient deployment: Align3 provides substantial alignment gains with minimal inference overhead, making it suitable for real-world applications where retraining is infeasible.
- Benchmarking and diagnosis: SpecBench exposes alignment gaps and trade-offs, guiding model selection and further research.
Future developments may include extending SpecBench to more domains, integrating richer forms of user feedback, and developing adaptive TTD strategies that dynamically adjust to specification changes or user preferences. The formalization of specification alignment as a constrained optimization problem also opens avenues for principled algorithmic advances in safe and helpful LLM deployment.
Conclusion
The paper provides a comprehensive framework for specification alignment in LLMs, introducing both a unified benchmark (SpecBench) and an efficient TTD method (Align3) that advances the safety-helpfulness Pareto frontier. The empirical results demonstrate that test-time deliberation is an effective and scalable strategy for real-world alignment, and the methodology sets a new standard for scenario-aware evaluation and optimization of LLM behavior.