Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks

Published 27 Jun 2024 in cs.LG, cs.AI, and cs.CL | (2407.00121v1)

Abstract: LLMs have recently shown tremendous promise in serving as the backbone to agentic systems, as demonstrated by their performance in multi-faceted, challenging benchmarks like SWE-Bench and Agent-Bench. However, to realize the true potential of LLMs as autonomous agents, they must learn to identify, call, and interact with external tools and application program interfaces (APIs) to complete complex tasks. These tasks together are termed function calling. Endowing LLMs with function calling abilities leads to a myriad of advantages, such as access to current and domain-specific information in databases and knowledge sources, and the ability to outsource tasks that can be reliably performed by tools, e.g., a Python interpreter or calculator. While there has been significant progress in function calling with LLMs, there is still a dearth of open models that perform on par with proprietary LLMs like GPT, Claude, and Gemini. Therefore, in this work, we introduce the GRANITE-20B-FUNCTIONCALLING model under an Apache 2.0 license. The model is trained using a multi-task training approach on seven fundamental tasks encompassed in function calling, those being Nested Function Calling, Function Chaining, Parallel Functions, Function Name Detection, Parameter-Value Pair Detection, Next-Best Function, and Response Generation. We present a comprehensive evaluation on multiple out-of-domain datasets comparing GRANITE-20B-FUNCTIONCALLING to more than 15 other best proprietary and open models. GRANITE-20B-FUNCTIONCALLING provides the best performance among all open models on the Berkeley Function Calling Leaderboard and fourth overall. As a result of the diverse tasks and datasets used for training our model, we show that GRANITE-20B-FUNCTIONCALLING has better generalizability on multiple tasks in seven different evaluation datasets.

Abstract PDF HTML Upgrade to Chat

Citations (8)

View on Semantic Scholar

Summary

The paper introduces Granite-20B-FunctionCalling, an LLM that enhances function calling abilities through a multi-task learning approach on granular tasks.
It unifies training data in JSON format and tackles six specialized sub-tasks, achieving notable improvements in F1, LCS, and exact match scores.
The model demonstrates strong generalizability and competitive benchmark performance, ranking fourth on the BFCL and excelling in function name detection.

Granite-20B-FunctionCalling: Multi-Task Learning for Enhanced Function Calling

This paper introduces Granite-20B-FunctionCalling, a new LLM designed to improve function calling abilities through multi-task learning on granular tasks (2407.00121). The model is instruction-tuned using a diverse set of datasets and tasks related to function calling, with a focus on granular sub-tasks such as function name detection, parameter-value pair detection, and function sequencing. The paper presents a comprehensive evaluation of Granite-20B-FunctionCalling against other open and proprietary models, demonstrating its strong performance and generalizability in out-of-domain settings.

Methodology and Training Data

The core contribution of this work is the multi-task training approach, which reuses data in different formats with distinct instructions for various function-calling related tasks. The authors identified six underlying sub-tasks for function calling, dividing them into two categories: High-Level Function Calling Tasks (Nested Function Calling, Function Chaining, Parallel Functions) and Low-Level Function Calling Tasks (Next-Best Function, Function Name Detection, Parameter-Value Pair Detection). A seventh task, Response Generation, was included to ensure the model could produce natural language responses.

Figure 1: Step-by-step building process of Granite-20B-FunctionCalling.

The training data was unified into a consistent JSON format to represent APIs, tools, and functions. This unification process ensures that the model can effectively parse and utilize information from various sources. The model output representation of function calls follows a standardized JSON format:

1
2
3

{
"name": "<FUNCTION-NAME>", "arguments": {"<PARAMETER-1>": "VALUE-1", "<PARAMETER-2>": ...}
}

Instruction Tuning and Training Details

Granite-20B-FunctionCalling is an instruct-tuned version of Granite-20B-Code-Instruct. The training data mixture consisted of 142K examples spanning all the defined tasks and datasets. The model was trained using QLoRA fine-tuning with a rank of 8, an alpha of 32, and a dropout of 0.1. The training used a learning rate of 5e-5 and the ApexFusedAdam optimizer with a linear learning rate scheduler. The training was conducted on a single node with 8 A100_80GB GPUs for a total of 3 epochs.

Experimental Results

The paper presents an extensive evaluation of Granite-20B-FunctionCalling, comparing it against other state-of-the-art function calling models on various datasets and a public leaderboard. The evaluation focuses on assessing the model's generalizability by using out-of-domain datasets.

Berkeley Function Calling Leaderboard (BFCL)

Granite-20B-FunctionCalling is ranked fourth in overall accuracy among the top 15 models on the BFCL, achieving the highest rank among models with open licenses. It achieves an AST Summary score of 84.11, an Execution Summary score of 86.50, and a Relevance score of 87.08. While tied with the Gorilla model, Granite-20B-FunctionCalling exhibits better generalization across different datasets.

Figure 2: Evaluation of Granite-20B-FunctionCalling against the best open function calling models (according to BFCL).

Function Calling Academic Benchmarks

On ToolLLM datasets (G1, G2, and G3) and RestGPT, Granite-20B-FunctionCalling achieves the best performance in detecting function names, with an 8% improvement in F1 score compared to the next best model. The model also outperforms others in sequencing metrics, with a 7% improvement on LCS and an 11% improvement on Exact Match scores. On the API-Bank, ToolBench, and ToolAlpaca datasets, Granite-20B-FunctionCalling achieves an average F1 score of 0.87 for function name prediction.

Figure 3: Performance vs. Hallucination rates for Out-of-Domain Function Calling.

Response Generation

In response generation, Granite-20B-FunctionCalling achieves strong results on the API-Bank dataset, with performance closely following Meta-Llama-3-70B-Instruct. The model achieves a BertScore of 0.68 and a Rouge-L score of 0.47, demonstrating its ability to generate natural language responses effectively.

Conclusion

The paper successfully introduces Granite-20B-FunctionCalling, a function calling model with strong performance and generalizability. The multi-task learning approach, combined with a focus on granular tasks, enables the model to achieve state-of-the-art results among open-source models on various benchmarks. The comprehensive evaluation demonstrates the effectiveness of the proposed approach and highlights the potential of Granite-20B-FunctionCalling for real-world applications. Future work could explore incorporating full function specifications without truncation to enhance performance further.

Markdown