AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios (2505.16944v1)

Published 22 May 2025 in cs.AI and cs.CL

Abstract: LLMs have demonstrated advanced capabilities in real-world agentic applications. Growing research efforts aim to develop LLM-based agents to address practical demands, introducing a new challenge: agentic scenarios often involve lengthy instructions with complex constraints, such as extended system prompts and detailed tool specifications. While adherence to such instructions is crucial for agentic applications, whether LLMs can reliably follow them remains underexplored. In this paper, we introduce AgentIF, the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios. AgentIF features three key characteristics: (1) Realistic, constructed from 50 real-world agentic applications. (2) Long, averaging 1,723 words with a maximum of 15,630 words. (3) Complex, averaging 11.9 constraints per instruction, covering diverse constraint types, such as tool specifications and condition constraints. To construct AgentIF, we collect 707 human-annotated instructions across 50 agentic tasks from industrial application agents and open-source agentic systems. For each instruction, we annotate the associated constraints and corresponding evaluation metrics, including code-based evaluation, LLM-based evaluation, and hybrid code-LLM evaluation. We use AgentIF to systematically evaluate existing advanced LLMs. We observe that current models generally perform poorly, especially in handling complex constraint structures and tool specifications. We further conduct error analysis and analytical experiments on instruction length and meta constraints, providing some findings about the failure modes of existing LLMs. We have released the code and data to facilitate future research.

Summary

AgentIF: Evaluating Instruction Adherence in Agentic Contexts

LLMs have demonstrated considerable potential in real-world agentic scenarios, especially as they evolve into autonomous agents capable of executing complex tasks based on detailed instructions. With the proliferation of agentic applications, meeting the intricate demands of following lengthy instructions with complex constraints remains a challenge. This paper introduces AgentIF, a benchmark developed to evaluate the instruction adherence capabilities of LLMs specifically within agentic scenarios, thus addressing this significant challenge.

Benchmark Characteristics and Construction

AgentIF serves as the inaugural systematic benchmark evaluating LLM capabilities in following instructions that reflect real-world agentic contexts. It features instructions that are particularly realistic, long, and complex:

Realistic Instructions: The benchmark is constructed from 50 actual agentic applications derived from both industrial agents and open-source agentic systems. This roots the benchmark in practicality and reflects genuine application standards.
Length and Complexity: Instructions in this benchmark include, on average, 1,723 words with some instructions reaching up to 15,630 words. They feature approximately 11.9 constraints per instruction, incorporating diverse constraint types such as formatting and tool specifications. This complexity is representative of the detailed nature often required in real-world agentic scenarios.

The benchmark was methodically constructed by collecting 707 human-annotated instruction sets. For each instruction, they employed multifaceted evaluation metrics, comprising code-based, LLM-based, and hybrid code-LLM evaluations, to ensure comprehensive analysis.

Evaluation and Findings

Upon applying AgentIF to assess current advanced LLMs, the researchers found these models generally underperforming, particularly with complex constraint structures and tool specifications. The models, even the most advanced, managed to perfectly follow less than 30% of the instructions indicative of their struggle in real-world applications.

Error analysis highlighted prevalent failure modes, particularly with condition constraints and tool usage instructions. This indicates a challenge for LLMs, as they often fail to prioritize and adhere to lengthy, specification-heavy instructions, revealing the significant gap in instruction-following reliability.

Implications for Future AI Development

The implications of this research are significant for the development of AI agents. As LLM-based agents are envisioned for deployment in varied domains, establishing benchmarks such as AgentIF is crucial for identifying and addressing the fundamental shortfalls in instruction adherence capabilities.

The findings emphasize the necessity to enhance LLM robustness in managing long and complex instruction sets. Future developments in AI should consider improving the fidelity of LLMs to complex instructions through advanced reinforcement learning techniques or enriched training datasets with long-context instructions. Additionally, refining model design to better handle conditional constraints and specification-laden tasks could enhance their utility in agentic applications.

Conclusion

AgentIF stands as a robust benchmark, filling the gap in evaluating LLM performance in complex, realistic agentic scenarios. The insights drawn from this research can guide future advancements in LLM development, emphasizing the urgent need for improved instruction adherence capabilities to meet the demands of real-world applications effectively. The paper urges the research community to iterate upon existing models and develop more sophisticated training mechanisms to harness LLMs' full potential in agentic scenarios.

Related Papers

Tweets

https://twitter.com/GptMaestro/status/1939035656333992196