Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TinyAgent: Function Calling at the Edge (2409.00608v3)

Published 1 Sep 2024 in cs.CL and cs.LG

Abstract: Recent LLMs have enabled the development of advanced agentic systems that can integrate various tools and APIs to fulfill user queries through function calling. However, the deployment of these LLMs on the edge has not been explored since they typically require cloud-based infrastructure due to their substantial model size and computational demands. To this end, we present TinyAgent, an end-to-end framework for training and deploying task-specific small LLM agents capable of function calling for driving agentic systems at the edge. We first show how to enable accurate function calling for open-source models via the LLMCompiler framework. We then systematically curate a high-quality dataset for function calling, which we use to fine-tune two small LLMs, TinyAgent-1.1B and 7B. For efficient inference, we introduce a novel tool retrieval method to reduce the input prompt length and utilize quantization to further accelerate the inference speed. As a driving application, we demonstrate a local Siri-like system for Apple's MacBook that can execute user commands through text or voice input. Our results show that our models can achieve, and even surpass, the function-calling capabilities of larger models like GPT-4-Turbo, while being fully deployed at the edge. We open-source our dataset, models, and installable package and provide a demo video for our MacBook assistant agent.

Citations (5)

Summary

  • The paper introduces TinyAgent, an end-to-end framework to train and deploy task-specific small language models for efficient on-device function calling.
  • The methodology leverages a high-quality dataset from GPT-4-Turbo and fine-tunes TinyAgent-1.1B and TinyAgent-7B, significantly boosting success rates.
  • Practical techniques like fine-tuned tool retrieval and 4-bit quantization reduce model size 4x and speed up inference by over 30%.

TinyAgent: Function Calling at the Edge

The paper "TinyAgent: Function Calling at the Edge" explores the deployment of small LLMs (SLMs) to perform function-calling tasks on edge devices. Unlike the traditional large-scale LLMs that depend heavily on cloud infrastructure due to their substantial size and computational demands, TinyAgent is designed to operate locally, addressing various critical issues such as privacy concerns, network connectivity, and latency.

Contributions and Framework

The authors introduce TinyAgent, an end-to-end framework to train and deploy task-specific SLMs. The primary objective of TinyAgent is twofold: enabling function calling for open-source models through the LLMCompiler framework and designing a high-quality dataset for fine-tuning these models. TinyAgent encompasses two models, TinyAgent-1.1B and TinyAgent-7B, which target efficient function calling and orchestration capabilities.

Challenges Addressed

LLMs, such as GPT-4o and Gemini-1.5, inherently demand high computational resources and memory, making local deployment challenging. These models store comprehensive world knowledge in their parametric memory, which is often unnecessary for specialized downstream applications. Thus, a smaller model that can perform specific tasks related to function calling with significant accuracy and efficiency is the focal concern of this research.

Methodology

Dataset Generation

The researchers created a specialized dataset for function calling tasks. Using a pre-defined set of 16 functions pertinent to a MacBook assistant agent, the dataset was synthesized using GPT-4-Turbo. This automated method to produce 80K training, 1K validation, and 1K testing instances ensured that the data remained high in quality and relevant to the specific tasks at hand.

Fine-tuning SLMs

Fine-tuning involved models such as TinyLlama-1.1B and Wizard-2-7B. The datasets improved the performance of these models significantly, as reflected in success rates: TinyAgent-1.1B improved from 12.71% to 78.89%, and TinyAgent-7B showed an increase from 41.25% to 83.09%. These results were even superior to GPT-4-Turbo in specific contexts.

Efficiency Improvements

A critical aspect of deploying AI models at the edge is minimizing resource consumption. The TinyAgent framework addresses this by:

  1. Tool Retrieval: Efficiently reducing the prompt length based on the relevance of tools required for the task using a fine-tuned DeBERTa-v3-small model.
  2. Quantization: Compressing the model weights to a 4-bit precision which results in reduced storage and faster inference times.

Tool Retrieval (Tool RAG)

Tool RAG is explicitly designed to handle multi-label classification problems, ensuring relevance-based selection of tools with high accuracy. This method significantly minimizes the input size, thereby optimizing the model's overall efficiency without compromising on performance.

Quantization

Quantizing the models to a 4-bit precision reduced model size by approximately 4x and improved inference speed by more than 30%, all while maintaining, and sometimes even improving, accuracy. This makes TinyAgent highly suitable for practical on-device deployment.

Implications and Future Directions

The ability to deploy small, efficient models like TinyAgent on edge devices opens up vast opportunities for real-world applications where privacy, speed, and resource efficiency are paramount. This research demonstrates that SLMs can be fine-tuned to match and exceed the performance of larger models like GPT-4-Turbo for specific tasks.

Future developments may involve extending this framework to a broader range of devices and applications, refining the dataset curation processes, and exploring more advanced quantization techniques. Another potential avenue could be integrating more sophisticated tool retrieval mechanisms that further enhance inference efficiency.

Conclusion

The paper presents an innovative approach to deploying function-calling LLMs at the edge by introducing TinyAgent. Through rigorous methodology involving dataset curation, fine-tuning, efficient tool retrieval, and quantization, TinyAgent demonstrates significant advancements in the deployment of AI agents on resource-constrained devices. This work paves the way for secure, private, and efficient local AI applications, marking a substantial step forward in the domain of edge computing.