Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 92 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 19 tok/s

GPT-5 High 18 tok/s Pro

GPT-4o 96 tok/s

GPT OSS 120B 473 tok/s Pro

Kimi K2 26 tok/s Pro

2000 character limit reached

AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security (2504.20965v2)

Published 29 Apr 2025 in cs.LG

Abstract: We introduce AegisLLM, a cooperative multi-agent defense against adversarial attacks and information leakage. In AegisLLM, a structured workflow of autonomous agents - orchestrator, deflector, responder, and evaluator - collaborate to ensure safe and compliant LLM outputs, while self-improving over time through prompt optimization. We show that scaling agentic reasoning system at test-time - both by incorporating additional agent roles and by leveraging automated prompt optimization (such as DSPy)- substantially enhances robustness without compromising model utility. This test-time defense enables real-time adaptability to evolving attacks, without requiring model retraining. Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM. On the WMDP unlearning benchmark, AegisLLM achieves near-perfect unlearning with only 20 training examples and fewer than 300 LM calls. For jailbreaking benchmarks, we achieve 51% improvement compared to the base model on StrongReject, with false refusal rates of only 7.9% on PHTest compared to 18-55% for comparable methods. Our results highlight the advantages of adaptive, agentic reasoning over static defenses, establishing AegisLLM as a strong runtime alternative to traditional approaches based on model modifications. Code is available at https://github.com/zikuicai/aegisLLM

Collections

Summary

The paper introduces AegisLLM, a multi-agent framework that enhances LLM security against attacks and leakage at test-time without retraining.
AegisLLM uses orchestrator, deflector, responder, and evaluator agents that collaborate to analyze queries, block unsafe content, and perform final safety checks dynamically.
Evaluations show AegisLLM achieves near-perfect unlearning on sensitive knowledge benchmarks and competitive jailbreaking defense while maintaining high performance on general tasks.

AegisLLM: Adaptive Agentic Guardrails for LLM Security

The paper introduces AegisLLM, a novel framework aimed at enhancing the security of LLMs against adversarial attacks and information leakage. AegisLLM employs a cooperative, multi-agent approach, wherein the orchestrator, deflector, responder, and evaluator agents collaborate to secure and optimize the outputs of LLMs at test-time. Through these roles, the system adapts to evolving threats, optimizing prompts in real-time without requiring retraining of the model.

Architectural Design and Function

AegisLLM is composed of several autonomous agents, each with a designated function:

Orchestrator: Analyzes queries to classify them as safe or unsafe and routes them accordingly.
Deflector: Generates non-informative responses for flagged unsafe queries, effectively blocking harmful information requests.
Responder: Provides informative outputs for queries deemed safe, preserving usability.
Evaluator: Performs a final safety check, verifying the compliance of the generated responses. If unsafe content is detected, the cycle re-routes to the deflector.

This multi-agent design fosters adaptability, as each agent optimizes its role through prompt optimization via DSPy, thereby refining threat detection without compromising the model’s utility.

Experimental Evaluation and Results

The efficacy of AegisLLM is manifested through evaluations on key benchmarks:

Unlearning Tasks: Using the Weapons of Mass Destruction Proxy (WMDP) benchmark, AegisLLM demonstrated near-perfect unlearning capabilities with significant accuracy reductions on sensitive questions (Cyber, Bio, Chem subsets), while retaining high performance on general benchmarks like MMLU and MT-Bench. Particularly, it achieves close to theoretical minimum accuracy for complete knowledge suppression, indicating effective compartmentalization of sensitive knowledge.
Jailbreaking Defense: AegisLLM shows competitive results against adversarial attacks as measured by StrongREJECT scores. It balances attack resistance with a seamless user experience by maintaining high compliance rates on PHTest benchmarks, surpassing other techniques that rely on extensive training models.

Furthermore, its rapid adaptability is highlighted by the ability to generalize defense mechanisms against diverse attack types using limited training samples.

Implications and Future Directions

AegisLLM’s architecture presents several implications:

Dynamic Security Enhancements: The approach shifts focus from static model modifications to dynamic, runtime security optimizations, enabling real-time robustness against emerging threats.
Modular and Scalable Framework: The agentic design supports modular additions or alterations, potentially accommodating a wider range of security categories simply by configuring agent roles.

AegisLLM offers substantial potential to advance AI safety through modular, inference-time security systems. Its adaptability to rapidly evolving threat landscapes highlights a promising direction for developing autonomous, scale-efficient security mechanisms in LLMs, paving the way for future exploration into more generalized AI safety frameworks.

AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security (2504.20965v2)

Collections

Summary

AegisLLM: Adaptive Agentic Guardrails for LLM Security

Architectural Design and Function

Experimental Evaluation and Results

Implications and Future Directions

Paper Prompts

Follow-up Questions

Authors (8)

GitHub

Tweets

YouTube

AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security (2504.20965v2)

Collections

Summary

AegisLLM: Adaptive Agentic Guardrails for LLM Security

Architectural Design and Function

Experimental Evaluation and Results

Implications and Future Directions

Paper Prompts

Follow-up Questions

Related Papers

Authors (8)

GitHub

Tweets

YouTube