Small Models, Big Tasks: Evaluating Small LLMs for Function Calling
The paper "Small Models, Big Tasks: An Exploratory Empirical Study on Small LLMs for Function Calling" presents an empirical analysis of small LLMs (SLMs) and their potential in the domain of function calling. Authored by researchers from IIIT-Hyderabad, the paper elucidates the performance of SLMs across varied inference strategies and in comparison to the computationally intensive LLMs, highlighting their practicality in resource-constrained environments.
Context and Motivation
Function calling, a critical task akin to code generation, depends significantly on the ability to accurately parse user queries to outputs formatted for execution. While LLMs have shown promise across diverse NLP tasks, their computational demands often limit their deployment in real-world scenarios, particularly where resources are limited. This paper investigates whether SLMs can bridge this gap, offering efficient, locally operable solutions for function invocation tasks.
Methodology
The authors choose five SLMs from the EvalPlus leaderboard, focusing on models with parameter sizes up to 4B. Their selection of models from the leaderboard is judicious, emphasizing models with demonstrated code synthesis capabilities—implying potential for reliable function call generation. The experimentation utilizes the Salesforce XLAM Function Calling dataset, consisting of 60,000 samples across diverse domains, split into test sets and finetuning segments.
Three inference strategies—zero-shot, few-shot, and finetuning—are rigorously evaluated. Zero-shot execution tests the pre-trained capabilities of SLMs directly on unseen prompts, whereas few-shot prompting provides supplemental task-specific examples, offering context to guide model outputs. Finetuning tailors the models on extensive datasets to enhance domain-specific performance. The paper further explores the effect of prompt injection, appending random strings to assess model robustness, and adapts GGUF formats for deployment on edge devices, with specific focus on latency and memory usage.
Results
The paper reveals that SLMs generally struggle with zero-shot prompting, often failing to produce outputs adhering to predefined JSON formats. Deepseek-Coder emerges as the only model showing promise in zero-shot settings, indicating marginal understanding of the function calling task. Introducing few-shot examples drastically improves the performance, with Deepseek-Coder and Phi-3-Mini models showing pronounced improvement across metrics such as JSON parsability and task accuracy. Fine-tuning models reveal significant enhancement, rectifying errors of the zero-shot and few-shot phases, with further minor degradation observed in response to prompt injection. Edge device deployments display a diverse performance profile, where Deepseek-Coder continues to achieve an optimal balance between function accuracy and computational demand.
Discussion
The findings underline the potential of SLMs for application in environments necessitating low-latency, computationally efficient operations. Despite performance variability across selected models, the evident enhancements with few-shot learning and fine-tuning underscore the opportunity to develop SLMs as specialized tools, possibly surpassing multi-purpose LLMs in targeted function call tasks. However, the robustness to prompt manipulations remains an area for further exploration, presenting a security challenge within practical deployments.
Future research directions might involve exploring structured output generation methods alongside AI-driven adversarial defense mechanisms. Furthermore, there is a need for industry-wide benchmarks to standardize SLM evaluation metrics, facilitating reliable integration into software engineering practices and reflectively portraying performance across function calling paradigms.
Conclusion
This research provides a valuable foundation for subsequent investigations into SLM efficacy for specialized tasks such as function calling, advocating a careful balance between adaptability and precision in adopting these models. It underscores the scope to refine SLM architectures and deployment strategies, paving the path for sustainable AI applications in real-world computational settings. Through this detailed inquiry, the authors contribute critical insights poised to inform both theoretical advancements and practical implementations in the field of LLMs and software engineering.