OpenGuardrails Platform
- OpenGuardrails is an open-source, context-aware platform that safeguards AI operations with integrated safety, manipulation detection, and privacy protection features.
- It combines a unified LLM-based guard, a specialized NER and regex data-leakage pipeline, and a policy-first enforcement plane to manage LLM and tool-orchestrated workflows.
- The platform achieves state-of-the-art performance with empirical benchmarks, flexible deployment options, and detailed audit logs for compliance and traceability.
OpenGuardrails is an open-source, context-aware platform designed to provide comprehensive safety, manipulation detection, and privacy protection for AI models and tool-orchestrated workflows. It delivers mathematical interpretability, layered enforcement mechanisms, policy-driven control, and state-of-the-art (SOTA) performance in both LLM and agentic tool environments. OpenGuardrails integrates a unified LLM-based guard, a lightweight data-leakage pipeline, policy enforcement for external tooling, and pluggable deployment modes, enabling controllable, explainable, and auditable safety for diverse AI-driven applications (Wang et al., 22 Oct 2025, Sigdel et al., 18 Mar 2026, Rebedea et al., 2023).
1. Architectural Components and Layered Design
OpenGuardrails features a modular, end-to-end architecture structured for fine-grained control and integration flexibility in both LLM-centric and tool-orchestrated ecosystems.
Core architectural layers include:
- Unified LLM-Based Guard Model: A 14B-parameter dense decoder-only transformer, fine-tuned on multilingual safety and manipulation data, and quantized via GPTQ to 3.3B parameters for production use. This model provides joint detection for content-safety (hate, violence, sexual content, and more) and model-manipulation attacks (prompt injection, jailbreaks, malicious code generation), supporting up to 8K token contexts for prompt, history, and response evaluation (Wang et al., 22 Oct 2025).
- NER/Redaction Data-Leakage Pipeline: Utilizes Presidio-style named-entity recognition and high-precision regex detectors to mask PII (names, SSNs, keys, etc.) or organizational secrets, operating independently but in tandem with the LLM guard (Wang et al., 22 Oct 2025).
- Deployment Layer: Supports Security Gateway Mode (transparent proxy for LLM APIs), API Service Mode (RESTful endpoints), and full privacy controls (on-premises or VPC cloud), with sample Docker Compose and Kubernetes manifests provided (Wang et al., 22 Oct 2025).
- Policy-First Enforcement Plane: In tool-using agentic workflows, every tool invocation passes through a Policy Enforcement Point (PEP) and Policy Decision Point (PDP), implementing runtime gating, output redaction, recovery controls, and logging for any script, agent, or CI bot (Sigdel et al., 18 Mar 2026).
- Audit & Telemetry: Immutable JSONL audit logs track every intercepted action, policy decision, rationale, fix hints, transformation, and outcome, uniquely supporting postmortem analysis and traceability (Sigdel et al., 18 Mar 2026).
2. Detection Models and Policy DSL Specification
OpenGuardrails formalizes detection and constraint logic for both LLM content and tool invocation, enabling mathematically controlled, reproducible enforcement.
Unified LLM-Based Guard Model
- Input format: JSON payload—
{"context": string, "policy": {categories: […], sensitivity: τ}} - First-token classification: Vocabulary with tokens .
- Decision function: For (pre-softmax logits),
Classify as unsafe if (user-specified).
- Loss function: Standard first-token cross-entropy,
- Multilingual coverage: Fine-tuned on 117+ languages, including specific China-focused mixes (OpenGuardrailsMixZh_97k), with prompt- and response-level supervision (Wang et al., 22 Oct 2025).
Policy-First DSL for Tool Workflows
The bespoke policy language governs tool invocation and output redaction:
- DSL grammar (excerpt):
1 2 3 |
<policies> ::= <policy>*
<policy> ::= "policy" <string> "{" <rule>* "}"
<rule> ::= tool/tool_group | allow_if | deny_if | require_approval_if | budget | on_output | redact_patterns |
- Risk-aware gating: Calls scored as
- triggers REQUIRE_APPROVAL; above a stricter , triggers DENY.
- Example policy:
1 2 3 4 5 6 |
policy "workspace_fs_safety" {
tool_group: ["fs.read", "fs.write", "fs.delete"]
allow_if: args.path starts_with context.workspace_root
deny_if: args.path matches "(^/|\.{2}/|~)"
require_approval_if: tool=="fs.delete" and args.recursive==true
} |
- Redaction: Patterns specified as regex for masking sensitive material in outputs.
3. Enforcement Mechanisms and Runtime Controls
OpenGuardrails enforces guardrails via runtime interception, rule evaluation, risk computation, constraint checking, and recovery orchestration.
LLM-Guarded Workflows
- Proxy deployment: All LLM prompts and completions routed through Security Gateway or API Service, with real-time classification, logging, and masking of unsafe or privacy-violating content (Wang et al., 22 Oct 2025).
- Policy-driven dynamic sensitivity: Per-usecase sensitivity thresholds and categories, e.g. financial, creative, etc.
Tool-Orchestrated Workflows
- Interception point: Every tool call is intercepted, evaluated against loaded policy packs (P0–P4) by the Policy Decision Point (PDP), with possible decisions ALLOW, DENY, REQUIRE_APPROVAL, or TRANSFORM (Sigdel et al., 18 Mar 2026).
- Recovery strategies: Configurable per tool—exponential backoff, max_retries, circuit breakers for high failure rates, fallback routines on persistent failures.
- Fix-hints: Automatic synthesis of actionable hints for remediation on DENY or RATE-LIMIT actions.
Audit and Explanation
- Decision record structure: Each audit log entry encodes trace_id, step_id, decision, triggered policy IDs, rationale, fix_hint, risk_score, applied transformations, timing overhead.
- Traceability: All decisions, including transformations and failure recovery steps, are logged for postmortem provenance and compliance.
4. Empirical Performance and Quantitative Results
OpenGuardrails demonstrates robust, SOTA results across LLM safety, multilingual classification, and policy-enforced tool workflows.
LLM Safety Benchmarks (Wang et al., 22 Oct 2025):
| Task | Model | F1 Score |
|---|---|---|
| English Prompt (avg) | OpenGuardrails-Text-2510 | 87.1 |
| English Response (avg) | OpenGuardrails | 88.5 |
| Chinese Prompt (avg) | OpenGuardrails | 87.4 |
| Chinese Response (avg) | OpenGuardrails | 85.2 |
| Multilingual Prompt (RTP-LX avg) | OpenGuardrails | 97.3 |
| Multilingual Response (PolyGuard-Response avg) | OpenGuardrails | 97.2 |
Resource profile: 3.3B model, ~6GB GPU VRAM, P95 latency 274.6 ms for 1K tokens on A100, throughput ~100 req/s per GPU. GPTQ quantization yields <1% F1 drop for 50% memory reduction and 2× speedup.
Tool Policy Enforcement Benchmarks (Sigdel et al., 18 Mar 2026):
| Policy Pack | Violation Prevention Rate (V) | False Block Rate (F) | Success Rate (S) | Retry Amplification (A) | Leakage Recall (L) |
|---|---|---|---|---|---|
| P0 | 0.000 | 0.000 | 0.356 | 3.774 | 0.875 |
| P3 | 0.597 | 0.067 | 0.133 | 1.689 | 0.875 |
| P4 | 0.681 | 0.067 | 0.067 | 1.378 | 0.875 |
Trade-offs: Incremental addition of argument constraints, budgets, approvals, and redaction increases violation prevention at the cost of utility (success rate), explicit in tabled results.
5. Guardrail Definition Languages and Programming Interfaces
OpenGuardrails exposes formal, extensible, and interpretable configuration interfaces targeted at both LLM and non-LLM integrations.
- LLM environment: Safety policies specified per request as lists of categories and thresholds; NER and regex-based PII detection configured via pipeline modules.
- DSL and Colang: For tool workflows and conversational rails, flows and canonical forms are declared using a Python-style or JSON/YAML-style DSL, supporting flow-branching, custom actions, approval gates, and output redaction. For NeMo Guardrails, Colang flow syntax enables composition of dialogue, topical, and execution rails (Rebedea et al., 2023).
- APIs: RESTful endpoints (OpenAPI), Python clients, plug-in hooks (e.g., LangChain), and runtime or gateway servers. Integration accommodates both model-agnostic and model-specific deployments.
6. Example Usage Scenarios and Patterns
Security Gateway:
1 2 3 4 5 6 |
version: '3.8'
services:
openguardrails:
image: openguardrails/security-gateway:latest
ports: [ "8080:8080" ]
volumes: [ ./policy.yaml:/app/policy.yaml ] |
policy.yaml selects categories (e.g., ["political", "data_leakage", "violence"]) and sensitivity. All application LLM traffic routed through gateway endpoint.
Standalone REST API:
1 2 3 4 |
from openguardrails.client import GuardClient client = GuardClient(endpoint="https://my‐privatedomain/api", api_key="****") request = {"context": "...", "policy": { "categories": [...], "sensitivity": 0.7 } } response = client.classify(request) |
Policy Pack for File-System Safety:
1 2 3 4 5 6 |
policy "workspace_fs_safety" {
tool_group: ["fs.read", "fs.write", "fs.delete"]
allow_if: args.path starts_with context.workspace_root
deny_if: args.path matches "(^/|\.{2}/|~)"
require_approval_if: tool=="fs.delete" and args.recursive==true
} |
Audit Example:
1 2 3 4 |
{"trace_id":"t1","step_id":3,"decision":"DENY",
"policy_ids":["workspace_fs_safety"],
"rationale":["fs.delete not under workspace_root","recursive delete requires approval"],
"fix_hint":"Use a workspace-relative path or request approval for recursive delete."} |
7. Licensing, Extensibility, and Community Integration
- Licensing: Released under Apache 2.0—includes code, model weights, pipelines, API specs, and deployment manifests (Wang et al., 22 Oct 2025).
- Ecosystem integration: Official repositories on GitHub and HuggingFace for models, pipelines, and sample deployments.
- Extensibility: Supports modular addition of custom policy packs, new detection or redaction modules, and tool adaptation. Model-agnostic and cloud/on-premises deployment support enable rapid adoption across both commercial and research settings.
OpenGuardrails unifies LLM-based safety and manipulation detection, NER- and regex-powered privacy redaction, model-agnostic policy enforcement for tool workflows, and detailed, audit-focused runtime infrastructure. It operationalizes mathematically interpretable risk-aware gating and flexible policy definition, establishing a reproducible and extensible infrastructure for AI guardrails across both conversational and automation domains (Wang et al., 22 Oct 2025, Sigdel et al., 18 Mar 2026, Rebedea et al., 2023).