- The paper introduces AegisLLM, a multi-agent framework that enhances LLM security against attacks and leakage at test-time without retraining.
- AegisLLM uses orchestrator, deflector, responder, and evaluator agents that collaborate to analyze queries, block unsafe content, and perform final safety checks dynamically.
- Evaluations show AegisLLM achieves near-perfect unlearning on sensitive knowledge benchmarks and competitive jailbreaking defense while maintaining high performance on general tasks.
AegisLLM: Adaptive Agentic Guardrails for LLM Security
The paper introduces AegisLLM, a novel framework aimed at enhancing the security of LLMs against adversarial attacks and information leakage. AegisLLM employs a cooperative, multi-agent approach, wherein the orchestrator, deflector, responder, and evaluator agents collaborate to secure and optimize the outputs of LLMs at test-time. Through these roles, the system adapts to evolving threats, optimizing prompts in real-time without requiring retraining of the model.
Architectural Design and Function
AegisLLM is composed of several autonomous agents, each with a designated function:
- Orchestrator: Analyzes queries to classify them as safe or unsafe and routes them accordingly.
- Deflector: Generates non-informative responses for flagged unsafe queries, effectively blocking harmful information requests.
- Responder: Provides informative outputs for queries deemed safe, preserving usability.
- Evaluator: Performs a final safety check, verifying the compliance of the generated responses. If unsafe content is detected, the cycle re-routes to the deflector.
This multi-agent design fosters adaptability, as each agent optimizes its role through prompt optimization via DSPy, thereby refining threat detection without compromising the model’s utility.
Experimental Evaluation and Results
The efficacy of AegisLLM is manifested through evaluations on key benchmarks:
- Unlearning Tasks: Using the Weapons of Mass Destruction Proxy (WMDP) benchmark, AegisLLM demonstrated near-perfect unlearning capabilities with significant accuracy reductions on sensitive questions (Cyber, Bio, Chem subsets), while retaining high performance on general benchmarks like MMLU and MT-Bench. Particularly, it achieves close to theoretical minimum accuracy for complete knowledge suppression, indicating effective compartmentalization of sensitive knowledge.
- Jailbreaking Defense: AegisLLM shows competitive results against adversarial attacks as measured by StrongREJECT scores. It balances attack resistance with a seamless user experience by maintaining high compliance rates on PHTest benchmarks, surpassing other techniques that rely on extensive training models.
Furthermore, its rapid adaptability is highlighted by the ability to generalize defense mechanisms against diverse attack types using limited training samples.
Implications and Future Directions
AegisLLM’s architecture presents several implications:
- Dynamic Security Enhancements: The approach shifts focus from static model modifications to dynamic, runtime security optimizations, enabling real-time robustness against emerging threats.
- Modular and Scalable Framework: The agentic design supports modular additions or alterations, potentially accommodating a wider range of security categories simply by configuring agent roles.
AegisLLM offers substantial potential to advance AI safety through modular, inference-time security systems. Its adaptability to rapidly evolving threat landscapes highlights a promising direction for developing autonomous, scale-efficient security mechanisms in LLMs, paving the way for future exploration into more generalized AI safety frameworks.