AgentDoG-Qwen3-4B: Diagnostic Guardrail Model

Updated 4 March 2026

AgentDoG-Qwen3-4B is a diagnostic guardrail model designed for fine-grained contextual monitoring and risk diagnosis in autonomous AI agents.
It leverages a dense, 4-billion-parameter Qwen3-4B Transformer architecture with an 8K token context window to perform trajectory-level safety classification and detailed risk attribution.
The model demonstrates state-of-the-art performance on the ATBench benchmark by achieving 92.8% accuracy and incorporates explainable AI techniques for transparent risk diagnosis.

AgentDoG-Qwen3-4B is a diagnostic guardrail model designed for fine-grained contextual monitoring and risk diagnosis in the domain of autonomous AI agents. It represents the 4-billion-parameter Qwen3-based variant of the AgentDoG framework, purpose-built for trajectory-level safety moderation, root-cause risk attribution, and explainability in agentic environments. AgentDoG-Qwen3-4B extends standard supervised fine-tuning (SFT) with an integrated explainable AI (XAI) module and demonstrates state-of-the-art performance on agentic safety benchmarks, notably ATBench (Liu et al., 26 Jan 2026).

1. Model Architecture

AgentDoG-Qwen3-4B is based on the Qwen3-4B-Instruct-2507 backbone, a dense decoder-only Transformer with approximately 4 billion parameters, supporting an 8 K token context window. The architecture comprises 32 transformer layers with a hidden dimension of 4096, 32 self-attention heads, and a 16 384-dimensional feedforward network. The model employs rotary positional embeddings and relative positional encoding, inheriting the pretraining and architectural conventions of the Qwen3-4B series (Yang et al., 14 May 2025).

2. Data Sources, Pretraining, and Fine-Tuning

Pretraining follows the Qwen3 family recipe: training on a mixture of web text (CommonCrawl, Wikipedia), public code (GitHub), multilingual corpora, and instruction-tuning dialogues, with byte-pair encoding (BPE) tokenization and a vocabulary size of approximately 64,000. The pretraining objective is standard causal language modeling (CLM) targeting next-token prediction.

AgentDoG-specific supervised fine-tuning leverages 100,000 synthesized agent trajectories, programmatically generated to span a comprehensive risk taxonomy (8 risk sources × 14 failure modes × 10 harm types). These trajectories, incorporating tool-augmented, multi-turn agent interactions using a library of over 10,000 tools, undergo rigorous quality validation via structural rules and LLM-based judges (52% pass rate). The fine-tuning prompt template is dual-task: (1) trajectory-level binary safety classification (safe/unsafe), (2) granular risk diagnosis (source, failure mode, harm type). The supervised loss is: $\mathcal{L} = -\sum_{(x_i, y_i)\in\mathcal{D}} \log \pi_\theta(y_i \mid x_i)$ with learning rate $10^{-5}$ and deterministic inference (temperature $0$) (Liu et al., 26 Jan 2026).

3. Three-Dimensional Risk Taxonomy and Diagnostic Protocol

AgentDoG employs a structured taxonomy with three orthogonal label sets:

$\mathcal{L}^{\mathrm{risk}}$ : Eight categories of risk source,
$\mathcal{L}^{\mathrm{mode}}$ : Fourteen failure modes,
$\mathcal{L}^{\mathrm{harm}}$ : Ten real-world harm types.

Each input agent trajectory $\mathcal{T}=\{t_1, \dots, t_n\}$ , consisting of action-observation pairs, receives both a binary safety verdict: $y \in \{\mathrm{safe}, \mathrm{unsafe}\},\quad y = \mathrm{unsafe} \;\Longleftrightarrow\; \exists i: \mathsf{Unsafe}(t_i) = \mathrm{True}$ and, in the case of unsafe trajectories, a fine-grained diagnosis tuple: $y_{\mathrm{fine}} = (\ell^{\mathrm{risk}}, \ell^{\mathrm{mode}}, \ell^{\mathrm{harm}}) \in \mathcal{L}^{\mathrm{risk}} \times \mathcal{L}^{\mathrm{mode}} \times \mathcal{L}^{\mathrm{harm}}$ This enables systematic annotation and evaluation of agentic risk and harm subtypes (Liu et al., 26 Jan 2026).

4. ATBench Benchmark and Evaluation Pipeline

ATBench, the corresponding benchmark, consists of 500 held-out trajectories averaging nine turns each (250 safe, 250 unsafe), with 2,292 previously unseen tools ensuring generalization. The annotation workflow includes: (1) quality scoring (with trajectories scoring below 3/5 filtered out), (2) multi-model labeling by QwQ-32B, GPT-5.2, Gemini 3 Pro, and DeepSeek V3.2 (majority-vote adjudication for both binary and taxonomy labels), and (3) human verification (exhaustive on "hard" cases). Metrics reported are accuracy, precision, recall, and F1 at the trajectory level, and per-dimension accuracy for risk diagnosis.

Model	ATBench Acc	P	R	F1
Qwen3-4B-Instruct	61.6	64.1	52.8	57.9
AgentDoG-Qwen3-4B	92.8	90.5	95.6	93.0

Model	Risk Source Acc	Failure Mode Acc	Harm Acc
Gemini-3 Pro	36.8	17.6	32.0
Qwen3-235B	19.6	17.2	38.0
AgentDoG-Qwen3-4B	82.0	32.4	58.4

These results establish AgentDoG-Qwen3-4B as a leading model for agentic safety monitoring (Liu et al., 26 Jan 2026).

5. Diagnostic Guardrail Methods and Explainability Features

AgentDoG extends supervised safety classification with provenance-aware risk diagnosis using agentic XAI (explainable AI) techniques. Given a full input trajectory $x$ , the model outputs $\pi_\theta(y \mid x)$ for the binary verdict and provides multi-axis diagnosis labels.

For root-cause attribution:

Trajectory-level attribution quantifies each step's incremental effect on the unsafe action as

$\Delta_i = \log \pi_\theta(a_{\mathrm{target}} \mid \mathcal{T}_{\le i}) - \log \pi_\theta(a_{\mathrm{target}} \mid \mathcal{T}_{\le i-1})$

Within-step sentence attribution evaluates the influence of sub-spans using $\mathrm{Drop}(x_{i,j})$ , $\mathrm{Hold}(x_{i,j})$ , and their sum, yielding a per-sentence XAI provenance score.

Outputs comprise binary verdicts, taxonomy labels, and highlighted trajectory steps or sentences that drive unsafe or unreasonable behavior. This structured, transparent diagnosis clarifies exactly which segment of agent reasoning or action precipitated risk, supporting effective post-hoc alignment.

6. Transparency, Alignment, Limitations, and Future Work

AgentDoG-Qwen3-4B advances agentic alignment by generating not only binary labels but also granular, explainable risk attribution across the provenance chain from initial query, tool invocation, reasoning steps, to the ultimate action taken. Diagnostic outputs are intended to support RL-based targeted remediation: the fine-grained diagnoses can be used as feedback signals for online or offline RL alignment.

Notable limitations include restricted input modality (text-only; no vision or GUI signals), moderate recall on rare failure modes (32% mode accuracy), and a fixed, closed taxonomy that may lag behind emergent risk phenomena. Future development directions involve: (1) integrating multimodal guardrails (e.g., visual context), (2) leveraging diagnostic feedback in RL fine-tuning, and (3) dynamically extending the risk taxonomy through continual learning mechanisms.

The model and API are publicly accessible (HuggingFace: AI45Research/agentdog; Python SDK and HTTP endpoints provided), supporting reproducible research and ongoing benchmarking (Liu et al., 26 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security (2026)

Qwen3 Technical Report (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AgentDoG-Qwen3-4B.

AgentDoG-Qwen3-4B: Diagnostic Guardrail Model

1. Model Architecture

2. Data Sources, Pretraining, and Fine-Tuning

3. Three-Dimensional Risk Taxonomy and Diagnostic Protocol

4. ATBench Benchmark and Evaluation Pipeline

5. Diagnostic Guardrail Methods and Explainability Features

6. Transparency, Alignment, Limitations, and Future Work

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

AgentDoG-Qwen3-4B: Diagnostic Guardrail Model

1. Model Architecture

2. Data Sources, Pretraining, and Fine-Tuning

3. Three-Dimensional Risk Taxonomy and Diagnostic Protocol

4. ATBench Benchmark and Evaluation Pipeline

5. Diagnostic Guardrail Methods and Explainability Features

6. Transparency, Alignment, Limitations, and Future Work

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research