AegisLLM: Modular Security for LLMs

Updated 28 January 2026

AegisLLM is a modular and adaptive security framework that uses agentic decomposition to distribute risk moderation tasks among specialized agents.
It employs dynamic techniques like automated prompt optimization and ensemble online learning to combat adversarial attacks and emergent risks in real time.
The framework leverages advanced multi-agent communication, cryptographic protocols, and parameter-efficient tuning to enhance LLM safety and compliance.

AegisLLM is a suite of modular, multi-agent, and adaptive security and safety frameworks for LLMs, encompassing architectures for content risk moderation, prompt-injection defense, functional safety engineering, agent governance, and environment optimization. Its unifying principle is agentic decomposition: defense and oversight tasks are distributed among specialized agents (Orchestrator, Evaluator, Deflector, Responder, etc.), with dynamic adaptability driven by automated prompt optimization, ensemble online learning, and self-reflection mechanisms. Unlike static model modification approaches, AegisLLM implementations operate primarily at inference (test) time, enabling rapid scaling and real-time response to emergent risks, adversarial attacks, and evolving requirements.

1. Architecture and Agent Roles

AegisLLM organizes reasoning and defense as a pipeline of autonomous agent roles, each focused on a distinct security or moderation subtask (Cai et al., 29 Apr 2025, Shi et al., 2024, Ghosh et al., 2024). Core agents are:

Orchestrator: Accepts user input, extracts key concepts, and classifies for risk or policy violations (using rule-based or prompt-driven logic).
Responder: Generates candidate outputs if Orchestrator deems the query safe.
Evaluator: Performs second-stage compliance checking on generated outputs, identifying context-dependent leaks or violations.
Deflector: Responds to unsafe queries or outputs with suitable refusals or sanitized replies.
Specialized Agents (in domain-specific deployments): Examples include the Functional Safety Manager (hazard analysis), Functional Safety Expert (review), V&V Engineer (test planning) in automotive safety (Shi et al., 2024).

Workflows are formally described in high-level pseudocode:

def AegisLLM_Handle(q):
    safe_flag = Orchestrator(q)
    if safe_flag == True:
        a = Responder(q)
        safe_flag = Evaluator(q, a)
        if safe_flag == True:
            return a
    return Deflector(q)

This compositional paradigm supports dynamic role addition and configuration (e.g., introducing a Threat Predictor or Sanitizer) by supplying new system prompts, without retraining the underlying LLMs.

2. Adaptive Risk and Safety Taxonomies

AegisLLM frameworks adopt extensible hierarchical taxonomies for content risk and safety. The Aegis2.0 schema distinguishes 12 core hazard types (e.g., Hate/Identity Hate, Suicide/Self-Harm, Guns/Illegal Weapons, PII/Privacy, Criminal Planning, Controlled Substances, Sexual, plus subcategories) and supports 9 fine-grained extension categories discovered via annotation (Ghosh et al., 15 Jan 2025, Ghosh et al., 2024). Each category encodes formal must-not or may policies and example violations for high-fidelity annotation and training.

Sample core category definitions: | Index | Risk Category | Must-Not/Definition | Example | |-------|---------------------|------------------------------------------|-----------------------------------------| | 1 | Hate/Identity Hate | Slurs/demeaning content re: protected groups | “All [group X] are lazy parasites.” | | 2 | Suicide/Self-Harm | Promotion/instruction of self-harm | “Here is how to end it all painlessly.” | | 3 | Guns/Weapons | Instructions for illegal weapon creation | “How to convert a toy gun into a working pistol.” | | … | Sexual, PII, Criminal, etc. | … | … |

These schemas support scalable human and weak-supervision annotation at both prompt and output levels.

3. Model Training and Automated Optimization

AegisLLM leverages parameter-efficient fine-tuning (e.g., LoRA on Llama3.1-8B), ensemble budgeting, and co-evolutionary adversarial training to align guard models with annotated risk taxonomies (Ghosh et al., 15 Jan 2025, Ghosh et al., 2024, Liu et al., 27 Aug 2025). Models trained on Aegis2.0 or AEGISSAFETYDATASET achieve competitive or superior performance compared to larger, closed-source baselines in safety detection and refusal.

Automated prompt optimization is central. AegisLLM treats prompt design as either Bayesian optimization over discrete system-prompts (Cai et al., 29 Apr 2025) or adversarial co-evolution (attacker and defender prompts iteratively optimized using textual gradient feedback) (Liu et al., 27 Aug 2025). For co-evolution:

Attacker maximizes attack success rate (ASR) and relative output score.
Defender maximizes true positive (TPR) and maintains high true negative rate (TNR).
Textual Gradient Optimization (TGO+): Generates natural-language “gradient” suggestions for prompt edits, buffers past gradients for stabilization, and replays them to accelerate convergence.

Core scoring functions are: $S_{\mathrm{attack}} = w_{\mathrm{asr}} \cdot (\mathrm{ASR})^{p_{\mathrm{asr}}} + w_{\mathrm{sc}} \cdot (\Delta S_{\mathrm{rel}})^{p_{\mathrm{sc}}}$

$S_{\mathrm{defense}} = w_{\mathrm{tp}} \cdot (\mathrm{TPR})^{p_{\mathrm{tp}}} + w_{\mathrm{tn}} \cdot (\mathrm{TNR})^{p_{\mathrm{tn}}}$

Empirically, iterative prompt refinement (DSPy-style) yields rapid adaptation to emergent attack classes, near-perfect unlearning with minimal calls, and low false refusal rates on benign queries (Cai et al., 29 Apr 2025).

4. Governance, Security, and Agent Communication

AegisLLM extends to multi-agent security, using architectures adapted from SAGA for agent lifecycle governance (Syros et al., 27 Apr 2025). Agents are registered with a central Provider, which supports:

Per-agent contact policies: JSON rules specifying allowed initiator patterns, quotas for one-time keys (OTKs), and live editing/deactivation.
Cryptographic mechanisms: Mutual-TLS channels, Diffie-Hellman key exchange for OTK generation, fine-grained access-control tokens enforced per interaction, short-lived credential rotation.

Example policy:

[
  { "agents": "[email protected]:calendar_agent", "budget": 15 },
  { "agents": "*@corp.com:*", "budget": 25 }
]

Performance evaluation shows amortized communication overhead is ≤1% of task latency, with security vs. throughput tunable via key quotas. Comprehensive registry, key management, and audit/monitoring infrastructure are prescribed for robust deployment.

5. Optimizing Agent-Environment Interaction

AegisLLM formalizes agent-environment failure taxonomy and proposes environment-level optimizations independent of agent/LLM changes (Song et al., 27 Aug 2025).

Six failure modes are identified:

State-space Navigation Failure
State Awareness Failure
Tool Output Processing Failure
Domain Rule Violation
User Instruction Following Failure
Turn/Token Exhaustion

Targeted fixes include:

Observability Enhancement: Expands tool responses to clarify reachable state, confirm transitions (lookahead hints, explicit post-state).
Computation Offloading: Bundles sorted/min/avg outputs and rule validations in environment-layer wrappers.
Speculative Actions: Executes likely next steps and provides bundled results, reducing token/turn wastage.

Evaluations report mean success rate improvements of 6.7–12.5 pp across five benchmarks, with up to 17% API cost savings. Environment optimizations complement agentic advances and allow reliable deployment in complex tool-rich domains.

6. Empirical Results and Benchmarks

AegisLLM demonstrates substantial performance improvements across a range of security and moderation tasks:

Unlearning (WMDP benchmark): AegisLLM achieves near-random accuracy (Cyber: 24.4%, Bio: 25.4%, Chem: 27.2%) with minimal examples/calls, while retaining general knowledge (MMLU: 58.4%) (Cai et al., 29 Apr 2025).
Jailbreaking Defense: AegisLLM halves the strong rejection escape rate over base models (3.8% vs 7.8%) and maintains a benign false refusal rate of 7.9%, much lower than competing static defenses (Cai et al., 29 Apr 2025).
Prompt Injection Detection: Co-evolutionary frameworks (AEGIS) outperform baselines in TPR (0.84 vs 0.61) and attack success rate (1.00 vs 0.74 on strongest adversaries) (Liu et al., 27 Aug 2025).
Intrusion Detection for Tool-Augmented Agents: AegisMCP achieves session-level AP = 0.947, AUROC = 0.985 at 2% FPR, with sub-second end-to-end latency on edge hardware (Zhan et al., 22 Oct 2025).
Functional Safety: Aegis-Max reaches FSR score 82/100 and 95% case coverage for complex AEB engineering, with 80% reduction in manual engineering time (Shi et al., 2024).

7. Applications, Impact, and Directions

AegisLLM enables robust content moderation (guardrails), adversarial resilience (unlearning, jailbreak defense, prompt injection detection), privacy and safety education via adversarial games, agent governance in multi-agent LLM networks, and reliability improvements in tool-augmented deployment. Its open-source datasets and frameworks foster reproducible evaluation, rapid adaptation, and extensible risk coverage (Ghosh et al., 15 Jan 2025, Ghosh et al., 2024).

Extensible agentic composition and self-reflective prompt optimization position AegisLLM as a runtime alternative to conventional model-centric security approaches. Ongoing work aims to integrate multi-provider trust negotiation, dynamic role-driven policies, automatic anomaly detection, and environment-driven verification/certification.

AegisLLM’s modular, agentic paradigm underpins state-of-the-art security, safety, and functional assurance for LLM-based systems across commercial, edge, privacy-sensitive, and mission-critical domains.

Markdown Upgrade to Chat

References (8)

AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security (2025)

Aegis:An Advanced LLM-Based Multi-Agent for Intelligent Functional Safety Engineering (2024)

AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts (2024)

Aegis2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails (2025)

AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema (2025)

SAGA: A Security Architecture for Governing AI Agentic Systems (2025)

Aegis: Taxonomy and Optimizations for Overcoming Agent-Environment Failures in LLM Agents (2025)

AegisMCP: Online Graph Intrusion Detection for Tool-Augmented LLMs on Edge Devices (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AegisLLM.