ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario (2501.10132v1)

Published 17 Jan 2025 in cs.CL

Abstract: Enhancing LLMs with real-time APIs can help generate more accurate and up-to-date responses. However, evaluating the function calling abilities of LLMs in real-world scenarios remains under-explored due to the complexity of data collection and evaluation. In this work, we introduce ComplexFuncBench, a benchmark for complex function calling across five real-world scenarios. Compared to existing benchmarks, ComplexFuncBench encompasses multi-step and constrained function calling, which requires long-parameter filing, parameter value reasoning, and 128k long context. Additionally, we propose an automatic framework, ComplexEval, for quantitatively evaluating complex function calling tasks. Through comprehensive experiments, we demonstrate the deficiencies of state-of-the-art LLMs in function calling and suggest future directions for optimizing these capabilities. The data and code are available at \url{https://github.com/THUDM/ComplexFuncBench}.

Authors (5)

Lucen Zhong (3 papers)
Zhengxiao Du (22 papers)
Xiaohan Zhang (78 papers)
Haiyi Hu (1 paper)
Jie Tang (302 papers)

Summary

An Examination of Complex Function Calling Evaluation in Long Context Scenarios with ComplexFuncBench

The research paper titled "ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario" primarily addresses the challenge of enhancing LLMs with function calling capabilities in real-world applications, necessitating complex, multi-step, and constraint-driven function calls within long-context scenarios. The presented work acknowledges the limitations of current LLMs, which lack real-time and updateable knowledge, and seeks to remedy this through the development and testing of benchmarks that assess LLMs' ability to execute intricate function calls using external APIs.

Core Contributions and Methodology

The paper introduces ComplexFuncBench, a comprehensive benchmark designed specifically to evaluate the LLMs' abilities in executing multi-step and constrained function calls. The benchmark provides a structured framework within which tasks require sophisticated parameter reasoning and long-context memory use, capturing a more realistic and challenging set of function calling scenarios than previous benchmarks. Key features of ComplexFuncBench include:

Long parameter values and contexts, extending up to 128k tokens.
Constrained inference requirements, necessitating nuanced parameter value reasoning.
Scenarios that reflect realistic, intricate tasks requiring API call sequences over several steps.

To facilitate accurate evaluation, the authors also propose ComplexEval, an automated evaluation framework that assesses function call correctness through a multi-dimensional matching process. This involves rule-based, response-based, and LLM-based matching strategies, capturing subtle variations beyond exact string matches and allowing for alternative expressions of equivalent calls—a recognized challenge in current evaluation methodologies.

Benchmarking Insights and Experimental Analysis

The experimental results presented in the paper underscore significant deficiencies in current LLMs' function calling capabilities. Even state-of-the-art models fail to reliably infer and manage the dependencies and constraints inherent in complex API calls. The detailed error analysis indicates that parameter value errors constitute a predominant issue, highlighting a critical area for further developmental focus in LLM capabilities.

Table comparisons within the paper demonstrate that existing methods for benchmarking function calls commonly lack in handling complex, multi-turn exchanges due to their reliance on simple, prompt-based methods or limited rule-based assessments. In contrast, ComplexFuncBench provides a broad and intricate testing bed, better suited for examining state-of-the-art LLMs, and offering insights into domain-specific failures, particularly in parameter value reasoning.

Implications and Future Developments

The development of ComplexFuncBench and the associated evaluation methodologies denote a significant step towards embedding function calling abilities in LLMs capable of realistic, execution-driven applications. The research highlights the importance of refining LLM architectures and the training environment to better simulate real-world conditions involving extensive parameter inference and API interaction.

Practically, enhancing LLM function calling abilities could transform applications that rely heavily on real-time data and multi-step processes, such as booking systems, information retrieval across disparate databases, and complex decision support systems.

Conclusion

In conclusion, this paper advances the state of LLM function calling evaluations through a definitive benchmark that encapsulates the complexities of real-world scenarios. While the research clearly indicates substantial room for improvement in current LLM capabilities, it also paves the way for refining these models to handle increased complexity and constraints more effectively. As function calling becomes integral to LLM functionality, the ongoing iterations of benchmarks like ComplexFuncBench will be pivotal in assessing and guiding future innovations in LLM development.

PDF Markdown

Related Papers

GitHub

GitHub - THUDM/ComplexFuncBench: Complex Function Calling Benchmark.

Tweets

https://twitter.com/_philschmid/status/1883055264980709497

https://twitter.com/arXivGPT/status/1881764838571167906