Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 86 tok/s

Gemini 2.5 Pro 60 tok/s Pro

GPT-5 Medium 28 tok/s

GPT-5 High 34 tok/s Pro

GPT-4o 72 tok/s

GPT OSS 120B 441 tok/s Pro

Kimi K2 200 tok/s Pro

2000 character limit reached

LlamaFirewall: An open source guardrail system for building secure AI agents (2505.03574v1)

Published 6 May 2025 in cs.CR and cs.AI

Abstract: LLMs have evolved from simple chatbots into autonomous agents capable of performing complex tasks such as editing production code, orchestrating workflows, and taking higher-stakes actions based on untrusted inputs like webpages and emails. These capabilities introduce new security risks that existing security measures, such as model fine-tuning or chatbot-focused guardrails, do not fully address. Given the higher stakes and the absence of deterministic solutions to mitigate these risks, there is a critical need for a real-time guardrail monitor to serve as a final layer of defense, and support system level, use case specific safety policy definition and enforcement. We introduce LlamaFirewall, an open-source security focused guardrail framework designed to serve as a final layer of defense against security risks associated with AI Agents. Our framework mitigates risks such as prompt injection, agent misalignment, and insecure code risks through three powerful guardrails: PromptGuard 2, a universal jailbreak detector that demonstrates clear state of the art performance; Agent Alignment Checks, a chain-of-thought auditor that inspects agent reasoning for prompt injection and goal misalignment, which, while still experimental, shows stronger efficacy at preventing indirect injections in general scenarios than previously proposed approaches; and CodeShield, an online static analysis engine that is both fast and extensible, aimed at preventing the generation of insecure or dangerous code by coding agents. Additionally, we include easy-to-use customizable scanners that make it possible for any developer who can write a regular expression or an LLM prompt to quickly update an agent's security guardrails.

Collections

Summary

LlamaFirewall: An Open Source Security Framework for AI Agents

The paper "LlamaFirewall: An open source guardrail system for building secure AI agents" presents a comprehensive framework designed to address emerging security concerns posed by LLMs as they transition from traditional chatbot functionalities to autonomous agents capable of executing complex tasks. This transition introduces significant risks, including prompt injection, agent misalignment, and insecure code generation, which existing security mechanisms insufficiently mitigate.

Framework Overview

LlamaFirewall is developed with the intention to serve as a robust, layered security architecture for applications powered by LLMs. Its modular design encompasses three primary components:

PromptGuard 2: This element of LlamaFirewall is a refined jailbreak detector. Built using a BERT-style architecture, PromptGuard 2 offers real-time identification of universal jailbreak attempts, showcasing exceptional accuracy and latency performance. The paper emphasizes strong numerical results of an 88.7% recall at 1% false positive rate in English datasets, marking its efficacy in detecting explicit jailbreak techniques efficiently.
AlignmentCheck: As an experimental tool, AlignmentCheck scrutinizes the agent’s chain-of-thought reasoning in real-time to detect goal divergence induced by prompt injection. Powered by advanced LLM models such as Llama 4 Maverick, this component effectively reduces attack success rates by 83%, according to empirical results presented.
CodeShield: This is an online static analysis engine designed to intercept insecure or malicious code patterns. CodeShield is extensible across multiple programming languages, enabling syntax-aware detection using Semgrep and regex-based rules. It asserts a precision rate of 96% in cybersecurity coding benchmarks, facilitating secure code generation by AI agents.

Implications and Applications

The deployment of LlamaFirewall by Meta signifies its maturation into a practical tool for safeguarding LLM applications. By releasing it as open-source software, the framework invites collaboration to address the diverse threat landscape associated with AI autonomy. LlamaFirewall strives not only to secure AI development internally within Meta but also aims to establish a communal security foundation akin to Snort or YARA in traditional cybersecurity landscapes.

Future Directions

The paper acknowledges several areas for future exploration to bolster the efficacy of LlamaFirewall. Among these are expansions into multimodal security support, latency improvements for high-throughput environments, and broader risk coverage to include tool-use security. Moreover, the authors speculate about developing realistic benchmarks that reflect complex agent workflows and adversarial scenarios.

Conclusion

In presenting LlamaFirewall, the authors contribute a critical piece to the evolving puzzle of AI security. As LLMs continue to gain autonomy, frameworks like LlamaFirewall become indispensable, ensuring these technologies are deployed with the necessary safeguards to prevent misuse. LlamaFirewall exemplifies proactive engagement with AI security challenges and represents an essential step forward in maintaining trust and reliability in AI-driven processes.

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (19)

First 10 authors:

Tweets

https://twitter.com/arnitly/status/1963585546548060557

https://twitter.com/FpeSre/status/1936393516919250949

https://twitter.com/ai_studioxyz/status/1951054388371529870

YouTube

Show All Videos