TrustAgent: Towards Safe and Trustworthy LLM-based Agents (2402.01586v4)

Published 2 Feb 2024 in cs.CL, cs.AI, cs.LG, and cs.MA

Abstract: The rise of LLM-based agents shows great potential to revolutionize task planning, capturing significant attention. Given that these agents will be integrated into high-stake domains, ensuring their reliability and safety is crucial. This paper presents an Agent-Constitution-based agent framework, TrustAgent, with a particular focus on improving the LLM-based agent safety. The proposed framework ensures strict adherence to the Agent Constitution through three strategic components: pre-planning strategy which injects safety knowledge to the model before plan generation, in-planning strategy which enhances safety during plan generation, and post-planning strategy which ensures safety by post-planning inspection. Our experimental results demonstrate that the proposed framework can effectively enhance an LLM agent's safety across multiple domains by identifying and mitigating potential dangers during the planning. Further analysis reveals that the framework not only improves safety but also enhances the helpfulness of the agent. Additionally, we highlight the importance of the LLM reasoning ability in adhering to the Constitution. This paper sheds light on how to ensure the safe integration of LLM-based agents into human-centric environments. Data and code are available at https://github.com/agiresearch/TrustAgent.

PDF Abstract

Introduction

The surging integration of LLMs into AI agents represents a leap forward in modern technology. However, this advancement also brings forth intricate challenges concerning the safety and trustworthiness of these agents, particularly when they are vested with the power to affect the physical world. Addressing the knowledge gap surrounding the trustworthiness of LLM-based agents, a paper presents TrustAgent—a framework structured around the concept of an Agent Constitution. This paper contributes significantly to understanding how safety can be embedded into the roots of agent behavior through a tripartite safety strategy.

Related Work

LLMs like GPT-3.5, GPT-4, and Claude have set benchmarks in understanding and generating human-like text, fostering the creation of AI agents capable of completing various tasks by controlling external tools. Notwithstanding their intellectual prowess, LLM-based agents' safety protocols are not yet fully fleshed out, as highlighted by (Ruan et al., 2023). Research includes exploring an LLM emulation framework ToolEmu to identify risks in such agents and efforts like AgentMonitor (Naihin et al., 2023) to gauge and enhance agent safety in the wild.

TrustAgent Framework

At the core of TrustAgent lies the concept of Agent Constitution, which incorporates a set of safety regulations derived from legal statutes and practical wisdom across various domains. TrustAgent encapsulates a three-tiered safety approach: pre-planning, in-planning, and post-planning strategies. The primary aim is to instill safety knowledge beforehand, enforce safety during action planning, and validate the safety of action plans before execution through comprehensive inspections.

Using several closed and open-source LLMs for testing, the paper demonstrates that TrustAgent remarkably improves agents' safety without compromising their helpfulness. For example, GPT-4, equipped with TrustAgent, exhibited both the highest safety awareness and helpfulness across domains, suggesting that TrustAgent effectively steers agents towards safer and more useful actions.

Safety and Trustworthiness Interplay

An interesting facet of TrustAgent is how it reveals the inherent relationship between an agent’s reasoning capabilities and its potential for being a truly safe agent. While the framework itself is adept at preempting unsafe outcomes, an agent's innate reasoning ability is indispensable to planning successfully in complicated scenarios. For example, GPT-4's performance post TrustAgent implementation was significant: it managed to not only maintain safety according to the Agent Constitution but also align its actions with logical sequences that were helpful in accomplishing users' tasks.

Conclusion

This paper sheds light on the imperative of intertwining safety with the overall design of LLM-based agents. The enhancement of safety strategies proposed by TrustAgent sets a precedent for future work aiming to create trustworthiness in AI systems. Future research directions include expanding the definition of trustworthiness beyond safety to embody aspects like fairness, explainability, and robustness. While TrustAgent introduces a paramount stride in safe agent design, advancing the reasoning faculties of LLMs emerges as a critical axis for future development. The goal is a holistic approach, ensuring that agents do not only perform tasks safely but also align closely with broader ethical and trustworthy AI narratives.