AgentDoG-Qwen3-4B: Diagnostic Guardrail Model
- AgentDoG-Qwen3-4B is a diagnostic guardrail model designed for fine-grained contextual monitoring and risk diagnosis in autonomous AI agents.
- It leverages a dense, 4-billion-parameter Qwen3-4B Transformer architecture with an 8K token context window to perform trajectory-level safety classification and detailed risk attribution.
- The model demonstrates state-of-the-art performance on the ATBench benchmark by achieving 92.8% accuracy and incorporates explainable AI techniques for transparent risk diagnosis.
AgentDoG-Qwen3-4B is a diagnostic guardrail model designed for fine-grained contextual monitoring and risk diagnosis in the domain of autonomous AI agents. It represents the 4-billion-parameter Qwen3-based variant of the AgentDoG framework, purpose-built for trajectory-level safety moderation, root-cause risk attribution, and explainability in agentic environments. AgentDoG-Qwen3-4B extends standard supervised fine-tuning (SFT) with an integrated explainable AI (XAI) module and demonstrates state-of-the-art performance on agentic safety benchmarks, notably ATBench (Liu et al., 26 Jan 2026).
1. Model Architecture
AgentDoG-Qwen3-4B is based on the Qwen3-4B-Instruct-2507 backbone, a dense decoder-only Transformer with approximately 4 billion parameters, supporting an 8 K token context window. The architecture comprises 32 transformer layers with a hidden dimension of 4096, 32 self-attention heads, and a 16 384-dimensional feedforward network. The model employs rotary positional embeddings and relative positional encoding, inheriting the pretraining and architectural conventions of the Qwen3-4B series (Yang et al., 14 May 2025).
2. Data Sources, Pretraining, and Fine-Tuning
Pretraining follows the Qwen3 family recipe: training on a mixture of web text (CommonCrawl, Wikipedia), public code (GitHub), multilingual corpora, and instruction-tuning dialogues, with byte-pair encoding (BPE) tokenization and a vocabulary size of approximately 64,000. The pretraining objective is standard causal language modeling (CLM) targeting next-token prediction.
AgentDoG-specific supervised fine-tuning leverages 100,000 synthesized agent trajectories, programmatically generated to span a comprehensive risk taxonomy (8 risk sources × 14 failure modes × 10 harm types). These trajectories, incorporating tool-augmented, multi-turn agent interactions using a library of over 10,000 tools, undergo rigorous quality validation via structural rules and LLM-based judges (52% pass rate). The fine-tuning prompt template is dual-task: (1) trajectory-level binary safety classification (safe/unsafe), (2) granular risk diagnosis (source, failure mode, harm type). The supervised loss is: with learning rate and deterministic inference (temperature $0$) (Liu et al., 26 Jan 2026).
3. Three-Dimensional Risk Taxonomy and Diagnostic Protocol
AgentDoG employs a structured taxonomy with three orthogonal label sets:
- : Eight categories of risk source,
- : Fourteen failure modes,
- : Ten real-world harm types.
Each input agent trajectory , consisting of action-observation pairs, receives both a binary safety verdict: and, in the case of unsafe trajectories, a fine-grained diagnosis tuple: This enables systematic annotation and evaluation of agentic risk and harm subtypes (Liu et al., 26 Jan 2026).
4. ATBench Benchmark and Evaluation Pipeline
ATBench, the corresponding benchmark, consists of 500 held-out trajectories averaging nine turns each (250 safe, 250 unsafe), with 2,292 previously unseen tools ensuring generalization. The annotation workflow includes: (1) quality scoring (with trajectories scoring below 3/5 filtered out), (2) multi-model labeling by QwQ-32B, GPT-5.2, Gemini 3 Pro, and DeepSeek V3.2 (majority-vote adjudication for both binary and taxonomy labels), and (3) human verification (exhaustive on "hard" cases). Metrics reported are accuracy, precision, recall, and F1 at the trajectory level, and per-dimension accuracy for risk diagnosis.
| Model | ATBench Acc | P | R | F1 |
|---|---|---|---|---|
| Qwen3-4B-Instruct | 61.6 | 64.1 | 52.8 | 57.9 |
| AgentDoG-Qwen3-4B | 92.8 | 90.5 | 95.6 | 93.0 |
| Model | Risk Source Acc | Failure Mode Acc | Harm Acc |
|---|---|---|---|
| Gemini-3 Pro | 36.8 | 17.6 | 32.0 |
| Qwen3-235B | 19.6 | 17.2 | 38.0 |
| AgentDoG-Qwen3-4B | 82.0 | 32.4 | 58.4 |
These results establish AgentDoG-Qwen3-4B as a leading model for agentic safety monitoring (Liu et al., 26 Jan 2026).
5. Diagnostic Guardrail Methods and Explainability Features
AgentDoG extends supervised safety classification with provenance-aware risk diagnosis using agentic XAI (explainable AI) techniques. Given a full input trajectory , the model outputs for the binary verdict and provides multi-axis diagnosis labels.
For root-cause attribution:
- Trajectory-level attribution quantifies each step's incremental effect on the unsafe action as
- Within-step sentence attribution evaluates the influence of sub-spans using , , and their sum, yielding a per-sentence XAI provenance score.
Outputs comprise binary verdicts, taxonomy labels, and highlighted trajectory steps or sentences that drive unsafe or unreasonable behavior. This structured, transparent diagnosis clarifies exactly which segment of agent reasoning or action precipitated risk, supporting effective post-hoc alignment.
6. Transparency, Alignment, Limitations, and Future Work
AgentDoG-Qwen3-4B advances agentic alignment by generating not only binary labels but also granular, explainable risk attribution across the provenance chain from initial query, tool invocation, reasoning steps, to the ultimate action taken. Diagnostic outputs are intended to support RL-based targeted remediation: the fine-grained diagnoses can be used as feedback signals for online or offline RL alignment.
Notable limitations include restricted input modality (text-only; no vision or GUI signals), moderate recall on rare failure modes (32% mode accuracy), and a fixed, closed taxonomy that may lag behind emergent risk phenomena. Future development directions involve: (1) integrating multimodal guardrails (e.g., visual context), (2) leveraging diagnostic feedback in RL fine-tuning, and (3) dynamically extending the risk taxonomy through continual learning mechanisms.
The model and API are publicly accessible (HuggingFace: AI45Research/agentdog; Python SDK and HTTP endpoints provided), supporting reproducible research and ongoing benchmarking (Liu et al., 26 Jan 2026).