Papers
Topics
Authors
Recent
2000 character limit reached

DynaGuard: Dynamic Policy Guardian

Updated 5 November 2025
  • DynaGuard is a dynamic guardian model that evaluates free-form, user-defined policy rules to enforce tailored moderation in LLM systems.
  • It employs supervised fine-tuning and GRPO on diverse policy datasets, enabling both fast inference and detailed chain-of-thought reasoning.
  • The model outperforms static guardians by allowing runtime policy updates and achieving superior F1 scores across dynamic safety benchmarks.

DynaGuard is a dynamic guardian model designed to enforce user-defined moderation policies in LLM systems. Unlike standard guardian models such as LlamaGuard, which operate solely over fixed taxonomies of harms (e.g., violence, self-harm), DynaGuard evaluates system outputs against arbitrary, free-form policy rules provided at runtime, supporting highly customized and domain-specific moderation. It produces both compliance judgments (PASS/FAIL) and natural language explanations, with configurable modes for fast inference and explicit chain-of-thought reasoning.

1. Model Design and Training Paradigm

DynaGuard is implemented by supervised fine-tuning (SFT) and Gradient Reward Preference Optimization (GRPO) on the Qwen3 instruction model family at scales of 1.7B, 4B, and 8B parameters. At inference, DynaGuard takes as input a set of user-specified, free-form policy rules and the dialogue to be evaluated. Output consists of a PASS/FAIL classification and an optional human-readable explanation justifying the verdict.

Training employs the DynaBench dataset, which contains over 40,000 unique policies—covering regulatory, transactional, content control, and user experience domains—and dialog examples exhibiting both violations and compliant responses. Each policy is derived from a pool of 500 human-authored rules, with extensive paraphrase and rule expansion via LLM generation. The test set contains domain-specialized, adversarial, and alignment edge-case policies, all unseen during training. Common safety benchmarks (WildGuard, BeaverTails, ToxicChat, Aegis2.0) are reformulated as policy classification tasks for standardized evaluation.

The model supports two output modes, both incorporated in training:

  • Fast Mode: Outputs a compact pass/fail judgment (<answer> tag) and, optionally, a brief explanation (<explanation> tag).
  • Chain-of-Thought (CoT) Mode: Produces stepwise, explicit reasoning (> tag) preceding the verdict, making the decision process transparent.

    Training objectives formalize both direct target prediction and reasoning supervision:

    LCSFT(θ)=E(r,x,y)D[logPθ(yr,x)]\mathcal{L}_{\mathrm{C-SFT}}(\theta) = -\mathbb{E}_{(r, x, y) \sim \mathcal{D}} \left[ \log P_\theta(y \mid r, x) \right]

    LCTSFT(θ)=E(r,x,t,y)D[logPθ(t,yr,x)]\mathcal{L}_{\mathrm{CT-SFT}}(\theta) = -\mathbb{E}_{(r, x, t, y)\sim\mathcal{D}} \left[ \log P_\theta(t, y \mid r, x) \right]

    where rr are rules, xx is the dialogue, yy is the compliance label, and tt is the reasoning trace. Subsequent GRPO fine-tuning incorporates a clipped advantage estimator with KL regularization, further aligning outputs to human compliance judgments and reward structure.

    2. Dynamic Policy Support and Runtime Flexibility

    DynaGuard’s core innovation is its ability to enforce policies specified at runtime, without relying on any pre-enumerated categories or ontologies. This enables moderation for applications with custom, specialized, or rapidly changing requirements—such as financial institutions, healthcare platforms, educational tools, or agentic workflows.

    Policies can encode single or multi-rule instructions, as exemplified in the test set:

    1
    2
    
    1. Do not reveal the appointment times of gym members to any other users.
    2. Do not share internal company policies.

    The model receives the entire policy and conversational context and determines if any rule is violated, returning a corresponding explanation. This approach obviates the need for model retraining to accommodate new, domain-specific policies.

    The following table summarizes DynaGuard’s policy support relative to static models:

    Model Type Dynamic Policies Modular Updates
    LlamaGuard No (retrain)
    DynaGuard Yes (runtime)

    This dynamic method significantly advances the flexibility of LLM moderation in real-world systems.

    3. Inference Modes: Fast Detection and Chain-of-Thought Reasoning

    DynaGuard supports two primary inference regimes to accommodate diverse deployment trade-offs:

    • Fast Detection: Minimal token cost, latency-optimized. Outputs a verdict (PASS/FAIL) and optional short explanation, suitable for production pipelines sensitive to computational overhead.

    • Chain-of-Thought (CoT) Reasoning: Produces detailed, stepwise reasoning before answering, facilitating maximal auditability and interpretability for complex or high-stakes moderation settings.

    Both modes are available at inference time via prompt tags. Sample prompt format for CoT inference:

    1
    2
    
    <think> [Step-by-step reasoning here] </think>
    <answer> PASS|FAIL </answer>

    Sample pseudocode for toggling mode:

    1
    2
    3
    4
    5
    6
    7
    
    def dynaguard_evaluate(policy_text, dialogue, mode="fast"):
        if mode == "fast":
            prompt = f"<answer> Evaluate compliance: {policy_text} Dialogue: {dialogue}"
        else:
            prompt = f"<think> Evaluate compliance with reasoning: {policy_text} Dialogue: {dialogue}"
        output = model.generate(prompt)
        return parse_output(output)  # Returns PASS/FAIL, explanation

    Experimental results demonstrate that non-CoT output is only 1.3% behind CoT mode in F1 score, indicating that cost-sensitive deployments can rely on fast mode with little degradation in accuracy.

    4. Comparative Analysis

    DynaGuard outperforms static and prior dynamic guardians in both flexibility and accuracy. In comparative evaluation:

    Model DynaBench Safety Avg All Tasks Avg
    DynaGuard-8B (CoT) 73.1 81.1 79.7
    DynaGuard-8B 72.5 79.6 78.4
    GPT-4o-mini 70.1 76.9 75.8
    LlamaGuard3 13.1 72.1 62.3
    NemoGuard 23.7 73.9 65.5

    DynaGuard-8B achieves the highest accuracy across the full range of evaluations, with a significant lead on the DynaBench dynamic policy task, and competitive performance on standard safety classification. LlamaGuard's accuracy collapses on out-of-distribution policy detection (13.1% F1 on DynaBench), demonstrating the limitations of static, ontology-based approaches. Ablation studies show that performance is substantially enhanced by training on diverse compliance data and incorporating reasoning supervision, with multi-digit F1 gains observed.

    The model is robust to long, multi-rule policies (up to 91 rules per instance) and complex, multi-turn dialogue requiring multi-hop logic—categories where prior open-weight guardians do not maintain performance. Error analysis confirms DynaGuard’s capacity in adversarial and ambiguous settings not seen during training.

    5. Applications, Recovery, and Societal Context

    By supporting arbitrary policies, DynaGuard is applicable to a broad spectrum of domains, including but not limited to finance, healthcare, education, customer service, and agentic task systems. The model enables "agentic recovery"—guiding chatbots to self-correct and reduce downstream risk after a detected violation—by providing actionable, natural language feedback. This is especially relevant in regulated or mission-critical environments, where compliance with complex, nonstandard policy is required.

    Open weights and efficient inference enable on-premises deployment for sensitive data, with tunable speed/accuracy trade-offs. The model’s interpretability, via explicit explanations and reasoning traces, enhances transparency and support for human oversight, which remains necessary due to residual nonzero error rates, especially for ambiguous, complex, or fact-dependent policies.

    The following table summarizes model capabilities:

    Model Type Dynamic Policies Interpretability Local Weights Fast Inference
    LlamaGuard
    Reasoning (GR)
    API Model
    DynaGuard

    DynaGuard represents a shift from “hard-coded flags” to flexible, runtime policy enforcement—removing friction for novel use cases and vertical deployment.

    6. Benchmarks and Performance Metrics

    DynaGuard’s experimental evaluation utilizes the DynaBench test set (unseen dynamic policies), standard safety benchmarks, and the IFEval policy correction suite. Key metrics include F1 score relative to policy compliance labels.

    • On DynaBench, DynaGuard-8B (CoT) achieves 73.1% F1, outperforming GPT-4o-mini (70.1%) and LlamaGuard3 (13.1%).
    • Safety benchmark average: 81.1% for DynaGuard-8B (CoT), surpassing all static and prior dynamic guardians.
    • All tasks average: 79.7% for DynaGuard-8B (CoT).

    Fast mode maintains accuracy within 1.3% of CoT mode, and DynaGuard-1.7B outperforms all prior open-weight guardians at a fraction of their computational cost.

    In policy correction (IFEval), used in conjunction with Mistral-8B, DynaGuard detected and explained 157/232 mistakes, improving total performance by 13.8%. No static guardian could handle the same breadth of unseen policies.

    GRPO fine-tuning objective (abridged):

    JGRPO(θ)=E(t,y)πk(r,x)[min(π(t,yr,x)πk(t,yr,x)Aπk(r,x,t,y),clip()Aπk(r,x,t,y))βKL(π(r,x)πref(r,x))]\begin{aligned} \mathcal{J}_{\text{GRPO}}(\theta) = \mathbb{E}_{(t, y) \sim \pi_k(\cdot \mid r, x)} \Bigg[ \min\left( \frac{\pi(t, y | r, x)}{\pi_k(t, y | r, x)} A_{\pi_k}(r, x, t, y),\, \text{clip}\left( \cdot \right) A_{\pi_k}(r, x, t, y) \right) -\beta \cdot \text{KL}\left( \pi(\cdot| r, x) \| \pi_{\text{ref}}(\cdot| r, x) \right) \Bigg] \end{aligned}

    where AπkA_{\pi_k} is the advantage estimator.

    7. Limitations and Prospective Impact

    DynaGuard’s approach elevates moderation by supporting arbitrary, expressible policy rules, offering greater adaptability for new and evolving use cases. The model’s explanations raise the standard of trust and transparency in LLM moderation; however, robust human oversight and iterative policy review are still necessary due to imperfect accuracy, particularly on unclear or heavily logic-driven rules.

    A plausible implication is that widespread adoption of dynamic guardrails may drive shifts in LLM system safety and compliance, as developers are no longer limited by the static ontologies of prior guardians. In addition to increasing system robustness against adversarial prompt engineering and emergent behaviors, DynaGuard enables rapid policy iteration, which is critical in regulated or fast-moving domains.

    Summary: DynaGuard combines dynamic, user-defined policy enforcement with efficient, interpretable output modes and achieves state-of-the-art results across dynamic and standard safety benchmarks. The architecture closes critical utility and generalization gaps left by static guardians, and enables integration into domain-specific LLM deployments requiring reliable, transparent moderation.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to DynaGuard.