Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
136 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Gaming Tool Preferences in Agentic LLMs (2505.18135v1)

Published 23 May 2025 in cs.AI, cs.CL, cs.CR, and cs.LG

Abstract: LLMs can now access a wide range of external tools, thanks to the Model Context Protocol (MCP). This greatly expands their abilities as various agents. However, LLMs rely entirely on the text descriptions of tools to decide which ones to use--a process that is surprisingly fragile. In this work, we expose a vulnerability in prevalent tool/function-calling protocols by investigating a series of edits to tool descriptions, some of which can drastically increase a tool's usage from LLMs when competing with alternatives. Through controlled experiments, we show that tools with properly edited descriptions receive over 10 times more usage from GPT-4.1 and Qwen2.5-7B than tools with original descriptions. We further evaluate how various edits to tool descriptions perform when competing directly with one another and how these trends generalize or differ across a broader set of 10 different models. These phenomenons, while giving developers a powerful way to promote their tools, underscore the need for a more reliable foundation for agentic LLMs to select and utilize tools and resources.

Summary

Gaming Tool Preferences in Agentic LLMs

This paper presents a critical examination of the mechanism by which LLMs select and utilize tools, specifically highlighting vulnerabilities within the prevalent tool/function-calling protocols. Utilizing the Model Context Protocol (MCP) and function-calling APIs, LLMs rely predominantly on text descriptions to decide which external tools to invoke—a process that the authors identify as fragile and susceptible to exploitation. The paper investigates how strategic edits to tool descriptions can significantly alter tool usage rates when competing against alternatives.

Methodology and Findings

The authors conduct controlled experiments involving a series of edits to tool descriptions and measure their impact on the tool usage by prominent models such as GPT-4.1 and Qwen2.5-7B. The experimentation reveals that certain edits result in over 10 times more tool usage.

The paper identifies several types of edits that effectively increase tool usage:

  • Assertive Cues: Phrases asserting the tool's effectiveness dramatically increased usage, sometimes by over 7 times.
  • Active Maintenance Claims: Indications of active tool maintenance shifted preferences notably, especially for GPT-4.1.
  • Usage Examples: Providing examples of how tools can be used generally heightened their appeal, particularly among open models.
  • Name-Dropping: References to well-known entities or figures increased tool selection rates for some models, notably GPT-4.1.
  • Numerical Claims: Quantitative endorsements (e.g., user count) were impactful for certain models.
  • Lengthening Descriptions: Longer tool descriptions increased usage for GPT-4.1, while shorter ones had different effects.

Combining these edits into a single tool description yielded a significant amplification effect, boosting tool usage by more than 11 times for both models when competing with original descriptions.

Implications

The paper underscores a substantial concern regarding the integrity and fairness of tool selection processes within LLM systems. If a tool's invocation can be manipulated via superficial text edits, then current protocols are not just biased but potentially exploitable.

On a practical level, these findings offer developers strategies for optimizing tool visibility and usage within agentic systems. However, on a theoretical level, they expose critical weaknesses in agentic LLM design: the reliance on text descriptions alone without verification leads to fundamental susceptibility to manipulation.

Directions Forward

To counteract these vulnerabilities, the authors suggest considering additional channels for reliable selection criteria. These could include historical performance data, user ratings, or decentralized consensus protocols to ensure models have a grounded basis for tool selection beyond text descriptions.

Conclusion

The research highlights the need for an improved framework for tool selection in LLMs to prevent manipulation and ensure fair and reliable tool utilization. As AI systems increasingly incorporate external functionalities, addressing these vulnerabilities is essential for the development of robust, trustworthy agentic AI systems.