This paper presents a critical examination of the mechanism by which LLMs select and utilize tools, specifically highlighting vulnerabilities within the prevalent tool/function-calling protocols. Utilizing the Model Context Protocol (MCP) and function-calling APIs, LLMs rely predominantly on text descriptions to decide which external tools to invokeāa process that the authors identify as fragile and susceptible to exploitation. The paper investigates how strategic edits to tool descriptions can significantly alter tool usage rates when competing against alternatives.
Methodology and Findings
The authors conduct controlled experiments involving a series of edits to tool descriptions and measure their impact on the tool usage by prominent models such as GPT-4.1 and Qwen2.5-7B. The experimentation reveals that certain edits result in over 10 times more tool usage.
The paper identifies several types of edits that effectively increase tool usage:
- Assertive Cues: Phrases asserting the tool's effectiveness dramatically increased usage, sometimes by over 7 times.
- Active Maintenance Claims: Indications of active tool maintenance shifted preferences notably, especially for GPT-4.1.
- Usage Examples: Providing examples of how tools can be used generally heightened their appeal, particularly among open models.
- Name-Dropping: References to well-known entities or figures increased tool selection rates for some models, notably GPT-4.1.
- Numerical Claims: Quantitative endorsements (e.g., user count) were impactful for certain models.
- Lengthening Descriptions: Longer tool descriptions increased usage for GPT-4.1, while shorter ones had different effects.
Combining these edits into a single tool description yielded a significant amplification effect, boosting tool usage by more than 11 times for both models when competing with original descriptions.
Implications
The paper underscores a substantial concern regarding the integrity and fairness of tool selection processes within LLM systems. If a tool's invocation can be manipulated via superficial text edits, then current protocols are not just biased but potentially exploitable.
On a practical level, these findings offer developers strategies for optimizing tool visibility and usage within agentic systems. However, on a theoretical level, they expose critical weaknesses in agentic LLM design: the reliance on text descriptions alone without verification leads to fundamental susceptibility to manipulation.
Directions Forward
To counteract these vulnerabilities, the authors suggest considering additional channels for reliable selection criteria. These could include historical performance data, user ratings, or decentralized consensus protocols to ensure models have a grounded basis for tool selection beyond text descriptions.
Conclusion
The research highlights the need for an improved framework for tool selection in LLMs to prevent manipulation and ensure fair and reliable tool utilization. As AI systems increasingly incorporate external functionalities, addressing these vulnerabilities is essential for the development of robust, trustworthy agentic AI systems.