Small Models, Big Tasks: An Exploratory Empirical Study on Small Language Models for Function Calling (2504.19277v1)

Published 27 Apr 2025 in cs.AI and cs.SE

Abstract: Function calling is a complex task with widespread applications in domains such as information retrieval, software engineering and automation. For example, a query to book the shortest flight from New York to London on January 15 requires identifying the correct parameters to generate accurate function calls. LLMs can automate this process but are computationally expensive and impractical in resource-constrained settings. In contrast, Small LLMs (SLMs) can operate efficiently, offering faster response times, and lower computational demands, making them potential candidates for function calling on edge devices. In this exploratory empirical study, we evaluate the efficacy of SLMs in generating function calls across diverse domains using zero-shot, few-shot, and fine-tuning approaches, both with and without prompt injection, while also providing the finetuned models to facilitate future applications. Furthermore, we analyze the model responses across a range of metrics, capturing various aspects of function call generation. Additionally, we perform experiments on an edge device to evaluate their performance in terms of latency and memory usage, providing useful insights into their practical applicability. Our findings show that while SLMs improve from zero-shot to few-shot and perform best with fine-tuning, they struggle significantly with adhering to the given output format. Prompt injection experiments further indicate that the models are generally robust and exhibit only a slight decline in performance. While SLMs demonstrate potential for the function call generation task, our results also highlight areas that need further refinement for real-time functioning.

PDF Abstract

Small Models, Big Tasks: Evaluating Small LLMs for Function Calling

The paper "Small Models, Big Tasks: An Exploratory Empirical Study on Small LLMs for Function Calling" presents an empirical analysis of small LLMs (SLMs) and their potential in the domain of function calling. Authored by researchers from IIIT-Hyderabad, the paper elucidates the performance of SLMs across varied inference strategies and in comparison to the computationally intensive LLMs, highlighting their practicality in resource-constrained environments.

Context and Motivation

Function calling, a critical task akin to code generation, depends significantly on the ability to accurately parse user queries to outputs formatted for execution. While LLMs have shown promise across diverse NLP tasks, their computational demands often limit their deployment in real-world scenarios, particularly where resources are limited. This paper investigates whether SLMs can bridge this gap, offering efficient, locally operable solutions for function invocation tasks.

Methodology

The authors choose five SLMs from the EvalPlus leaderboard, focusing on models with parameter sizes up to 4B. Their selection of models from the leaderboard is judicious, emphasizing models with demonstrated code synthesis capabilities—implying potential for reliable function call generation. The experimentation utilizes the Salesforce XLAM Function Calling dataset, consisting of 60,000 samples across diverse domains, split into test sets and finetuning segments.

Three inference strategies—zero-shot, few-shot, and finetuning—are rigorously evaluated. Zero-shot execution tests the pre-trained capabilities of SLMs directly on unseen prompts, whereas few-shot prompting provides supplemental task-specific examples, offering context to guide model outputs. Finetuning tailors the models on extensive datasets to enhance domain-specific performance. The paper further explores the effect of prompt injection, appending random strings to assess model robustness, and adapts GGUF formats for deployment on edge devices, with specific focus on latency and memory usage.

Results

The paper reveals that SLMs generally struggle with zero-shot prompting, often failing to produce outputs adhering to predefined JSON formats. Deepseek-Coder emerges as the only model showing promise in zero-shot settings, indicating marginal understanding of the function calling task. Introducing few-shot examples drastically improves the performance, with Deepseek-Coder and Phi-3-Mini models showing pronounced improvement across metrics such as JSON parsability and task accuracy. Fine-tuning models reveal significant enhancement, rectifying errors of the zero-shot and few-shot phases, with further minor degradation observed in response to prompt injection. Edge device deployments display a diverse performance profile, where Deepseek-Coder continues to achieve an optimal balance between function accuracy and computational demand.

Discussion

The findings underline the potential of SLMs for application in environments necessitating low-latency, computationally efficient operations. Despite performance variability across selected models, the evident enhancements with few-shot learning and fine-tuning underscore the opportunity to develop SLMs as specialized tools, possibly surpassing multi-purpose LLMs in targeted function call tasks. However, the robustness to prompt manipulations remains an area for further exploration, presenting a security challenge within practical deployments.

Future research directions might involve exploring structured output generation methods alongside AI-driven adversarial defense mechanisms. Furthermore, there is a need for industry-wide benchmarks to standardize SLM evaluation metrics, facilitating reliable integration into software engineering practices and reflectively portraying performance across function calling paradigms.

Conclusion

This research provides a valuable foundation for subsequent investigations into SLM efficacy for specialized tasks such as function calling, advocating a careful balance between adaptability and precision in adopting these models. It underscores the scope to refine SLM architectures and deployment strategies, paving the path for sustainable AI applications in real-world computational settings. Through this detailed inquiry, the authors contribute critical insights poised to inform both theoretical advancements and practical implementations in the field of LLMs and software engineering.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Ishan Kavathekar (3 papers)
Raghav Donakanti (3 papers)
Ponnurangam Kumaraguru (129 papers)
Karthik Vaidhyanathan (23 papers)

Related Papers

Find Related Papers

YouTube

Show All Videos