Papers
Topics
Authors
Recent
Search
2000 character limit reached

OpenGuardrails Platform

Updated 25 March 2026
  • OpenGuardrails is an open-source, context-aware platform that safeguards AI operations with integrated safety, manipulation detection, and privacy protection features.
  • It combines a unified LLM-based guard, a specialized NER and regex data-leakage pipeline, and a policy-first enforcement plane to manage LLM and tool-orchestrated workflows.
  • The platform achieves state-of-the-art performance with empirical benchmarks, flexible deployment options, and detailed audit logs for compliance and traceability.

OpenGuardrails is an open-source, context-aware platform designed to provide comprehensive safety, manipulation detection, and privacy protection for AI models and tool-orchestrated workflows. It delivers mathematical interpretability, layered enforcement mechanisms, policy-driven control, and state-of-the-art (SOTA) performance in both LLM and agentic tool environments. OpenGuardrails integrates a unified LLM-based guard, a lightweight data-leakage pipeline, policy enforcement for external tooling, and pluggable deployment modes, enabling controllable, explainable, and auditable safety for diverse AI-driven applications (Wang et al., 22 Oct 2025, Sigdel et al., 18 Mar 2026, Rebedea et al., 2023).

1. Architectural Components and Layered Design

OpenGuardrails features a modular, end-to-end architecture structured for fine-grained control and integration flexibility in both LLM-centric and tool-orchestrated ecosystems.

Core architectural layers include:

  • Unified LLM-Based Guard Model: A 14B-parameter dense decoder-only transformer, fine-tuned on multilingual safety and manipulation data, and quantized via GPTQ to 3.3B parameters for production use. This model provides joint detection for content-safety (hate, violence, sexual content, and more) and model-manipulation attacks (prompt injection, jailbreaks, malicious code generation), supporting up to 8K token contexts for prompt, history, and response evaluation (Wang et al., 22 Oct 2025).
  • NER/Redaction Data-Leakage Pipeline: Utilizes Presidio-style named-entity recognition and high-precision regex detectors to mask PII (names, SSNs, keys, etc.) or organizational secrets, operating independently but in tandem with the LLM guard (Wang et al., 22 Oct 2025).
  • Deployment Layer: Supports Security Gateway Mode (transparent proxy for LLM APIs), API Service Mode (RESTful endpoints), and full privacy controls (on-premises or VPC cloud), with sample Docker Compose and Kubernetes manifests provided (Wang et al., 22 Oct 2025).
  • Policy-First Enforcement Plane: In tool-using agentic workflows, every tool invocation passes through a Policy Enforcement Point (PEP) and Policy Decision Point (PDP), implementing runtime gating, output redaction, recovery controls, and logging for any script, agent, or CI bot (Sigdel et al., 18 Mar 2026).
  • Audit & Telemetry: Immutable JSONL audit logs track every intercepted action, policy decision, rationale, fix hints, transformation, and outcome, uniquely supporting postmortem analysis and traceability (Sigdel et al., 18 Mar 2026).

2. Detection Models and Policy DSL Specification

OpenGuardrails formalizes detection and constraint logic for both LLM content and tool invocation, enabling mathematically controlled, reproducible enforcement.

Unified LLM-Based Guard Model

  • Input format: JSON payload—{"context": string, "policy": {categories: […], sensitivity: τ}}
  • First-token classification: Vocabulary with tokens y1{safe, unsafe}y_1 \in \{\text{safe, unsafe}\}.
  • Decision function: For zsafe,zunsafez_{\text{safe}}, z_{\text{unsafe}} (pre-softmax logits),

punsafe=exp(zunsafe)exp(zsafe)+exp(zunsafe)p_\text{unsafe} = \frac{\exp(z_\text{unsafe})}{\exp(z_\text{safe}) + \exp(z_\text{unsafe})}

Classify as unsafe if punsafeτp_\text{unsafe} \geq \tau (user-specified).

  • Loss function: Standard first-token cross-entropy,

L(θ)=[1y="unsafe"logpunsafe+1y="safe"logpsafe]L(\theta) = -[1_{y^*="unsafe"} \cdot \log p_\text{unsafe} + 1_{y^*="safe"} \cdot \log p_\text{safe}]

  • Multilingual coverage: Fine-tuned on 117+ languages, including specific China-focused mixes (OpenGuardrailsMixZh_97k), with prompt- and response-level supervision (Wang et al., 22 Oct 2025).

Policy-First DSL for Tool Workflows

The bespoke policy language governs tool invocation and output redaction:

  • DSL grammar (excerpt):

1
2
3
<policies> ::= <policy>*
<policy> ::= "policy" <string> "{" <rule>* "}"
<rule>   ::= tool/tool_group | allow_if | deny_if | require_approval_if | budget | on_output | redact_patterns

  • Risk-aware gating: Calls scored as R(c)=wtrtool+warargs+wcrctxR(c) = w_t r_{\mathrm{tool}} + w_a r_{\mathrm{args}} + w_c r_{\mathrm{ctx}}
    • R(c)τR(c)\geq\tau triggers REQUIRE_APPROVAL; above a stricter τdeny\tau_\text{deny}, triggers DENY.
  • Example policy:

1
2
3
4
5
6
policy "workspace_fs_safety" {
  tool_group: ["fs.read", "fs.write", "fs.delete"]
  allow_if: args.path starts_with context.workspace_root
  deny_if:  args.path matches "(^/|\.{2}/|~)"
  require_approval_if: tool=="fs.delete" and args.recursive==true
}

  • Redaction: Patterns specified as regex for masking sensitive material in outputs.

3. Enforcement Mechanisms and Runtime Controls

OpenGuardrails enforces guardrails via runtime interception, rule evaluation, risk computation, constraint checking, and recovery orchestration.

LLM-Guarded Workflows

  • Proxy deployment: All LLM prompts and completions routed through Security Gateway or API Service, with real-time classification, logging, and masking of unsafe or privacy-violating content (Wang et al., 22 Oct 2025).
  • Policy-driven dynamic sensitivity: Per-usecase sensitivity thresholds and categories, e.g. financial, creative, etc.

Tool-Orchestrated Workflows

  • Interception point: Every tool call is intercepted, evaluated against loaded policy packs (P0–P4) by the Policy Decision Point (PDP), with possible decisions ALLOW, DENY, REQUIRE_APPROVAL, or TRANSFORM (Sigdel et al., 18 Mar 2026).
  • Recovery strategies: Configurable per tool—exponential backoff, max_retries, circuit breakers for high failure rates, fallback routines on persistent failures.
  • Fix-hints: Automatic synthesis of actionable hints for remediation on DENY or RATE-LIMIT actions.

Audit and Explanation

  • Decision record structure: Each audit log entry encodes trace_id, step_id, decision, triggered policy IDs, rationale, fix_hint, risk_score, applied transformations, timing overhead.
  • Traceability: All decisions, including transformations and failure recovery steps, are logged for postmortem provenance and compliance.

4. Empirical Performance and Quantitative Results

OpenGuardrails demonstrates robust, SOTA results across LLM safety, multilingual classification, and policy-enforced tool workflows.

Task Model F1 Score
English Prompt (avg) OpenGuardrails-Text-2510 87.1
English Response (avg) OpenGuardrails 88.5
Chinese Prompt (avg) OpenGuardrails 87.4
Chinese Response (avg) OpenGuardrails 85.2
Multilingual Prompt (RTP-LX avg) OpenGuardrails 97.3
Multilingual Response (PolyGuard-Response avg) OpenGuardrails 97.2

Resource profile: 3.3B model, ~6GB GPU VRAM, P95 latency 274.6 ms for 1K tokens on A100, throughput ~100 req/s per GPU. GPTQ quantization yields <1% F1 drop for 50% memory reduction and 2× speedup.

Policy Pack Violation Prevention Rate (V) False Block Rate (F) Success Rate (S) Retry Amplification (A) Leakage Recall (L)
P0 0.000 0.000 0.356 3.774 0.875
P3 0.597 0.067 0.133 1.689 0.875
P4 0.681 0.067 0.067 1.378 0.875

Trade-offs: Incremental addition of argument constraints, budgets, approvals, and redaction increases violation prevention at the cost of utility (success rate), explicit in tabled results.

5. Guardrail Definition Languages and Programming Interfaces

OpenGuardrails exposes formal, extensible, and interpretable configuration interfaces targeted at both LLM and non-LLM integrations.

  • LLM environment: Safety policies specified per request as lists of categories and thresholds; NER and regex-based PII detection configured via pipeline modules.
  • DSL and Colang: For tool workflows and conversational rails, flows and canonical forms are declared using a Python-style or JSON/YAML-style DSL, supporting flow-branching, custom actions, approval gates, and output redaction. For NeMo Guardrails, Colang flow syntax enables composition of dialogue, topical, and execution rails (Rebedea et al., 2023).
  • APIs: RESTful endpoints (OpenAPI), Python clients, plug-in hooks (e.g., LangChain), and runtime or gateway servers. Integration accommodates both model-agnostic and model-specific deployments.

6. Example Usage Scenarios and Patterns

Security Gateway:

1
2
3
4
5
6
version: '3.8'
services:
  openguardrails:
    image: openguardrails/security-gateway:latest
    ports: [ "8080:8080" ]
    volumes: [ ./policy.yaml:/app/policy.yaml ]
Custom policy.yaml selects categories (e.g., ["political", "data_leakage", "violence"]) and sensitivity. All application LLM traffic routed through gateway endpoint.

Standalone REST API:

1
2
3
4
from openguardrails.client import GuardClient
client = GuardClient(endpoint="https://my‐privatedomain/api", api_key="****")
request = {"context": "...", "policy": { "categories": [...], "sensitivity": 0.7 } }
response = client.classify(request)

Policy Pack for File-System Safety:

1
2
3
4
5
6
policy "workspace_fs_safety" {
  tool_group: ["fs.read", "fs.write", "fs.delete"]
  allow_if: args.path starts_with context.workspace_root
  deny_if:  args.path matches "(^/|\.{2}/|~)"
  require_approval_if: tool=="fs.delete" and args.recursive==true
}

Audit Example:

1
2
3
4
{"trace_id":"t1","step_id":3,"decision":"DENY",
 "policy_ids":["workspace_fs_safety"],
 "rationale":["fs.delete not under workspace_root","recursive delete requires approval"],
 "fix_hint":"Use a workspace-relative path or request approval for recursive delete."}

7. Licensing, Extensibility, and Community Integration

  • Licensing: Released under Apache 2.0—includes code, model weights, pipelines, API specs, and deployment manifests (Wang et al., 22 Oct 2025).
  • Ecosystem integration: Official repositories on GitHub and HuggingFace for models, pipelines, and sample deployments.
  • Extensibility: Supports modular addition of custom policy packs, new detection or redaction modules, and tool adaptation. Model-agnostic and cloud/on-premises deployment support enable rapid adoption across both commercial and research settings.

OpenGuardrails unifies LLM-based safety and manipulation detection, NER- and regex-powered privacy redaction, model-agnostic policy enforcement for tool workflows, and detailed, audit-focused runtime infrastructure. It operationalizes mathematically interpretable risk-aware gating and flexible policy definition, establishing a reproducible and extensible infrastructure for AI guardrails across both conversational and automation domains (Wang et al., 22 Oct 2025, Sigdel et al., 18 Mar 2026, Rebedea et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OpenGuardrails Platform.