Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Octopus v2: On-device language model for super agent (2404.01744v5)

Published 2 Apr 2024 in cs.CL
Octopus v2: On-device language model for super agent

Abstract: LLMs have shown effectiveness in a variety of software applications, particularly in tasks related to automatic workflow. These models possess the crucial ability to call functions, which is essential in creating AI agents. Despite the high performance of large-scale LLMs in cloud environments, they are often associated with concerns over privacy and cost. Current on-device models for function calling face issues with latency and accuracy. Our research presents a new method that empowers an on-device model with 2 billion parameters to surpass the performance of GPT-4 in both accuracy and latency, and decrease the context length by 95\%. When compared to Llama-7B with a RAG-based function calling mechanism, our method enhances latency by 35-fold. This method reduces the latency to levels deemed suitable for deployment across a variety of edge devices in production environments, aligning with the performance requisites for real-world applications.

Enhancing On-Device AI Agents with Octopus v2: A Leap in Function-Calling Efficiency

Introduction to On-Device AI Innovation

The introduction of Octopus v2 represents a significant milestone in the evolution of on-device LLMs tailored for sophisticated AI agent applications. Developed by Chen and Li at Stanford University, this model addresses the cardinal challenges of deploying advanced function-calling capabilities directly on edge devices. Unlike its predecessors, Octopus v2 not only reduces dependence on cloud computing resources but also significantly diminishes latency, cost, and privacy concerns associated with large-scale LLMs.

Addressing the Challenges of On-Device Deployment

Scalability and Performance

Octopus v2 introduces a methodological leap in enhancing on-device AI agents' functionality, particularly in executing complex function calls within software applications. This research delineates a novel approach allowing a 2 billion parameter model to outperform GPT-4 in accuracy and latency metrics notably. In direct comparison with Llama-7B, a model known for its Retrieval-Augmented Generation (RAG) for function calling, Octopus v2 exhibits a 35-fold improvement in latency. This advancement is critical, as it brings on-device models a step closer to cloud-based giants in terms of performance, while significantly reducing operational costs and risk exposure to privacy breaches.

Efficiency and Reduced Context Length

One of the cornerstone achievements of Octopus v2 is its managerial efficiency regarding context length - a reduction by 95%. This is achieved through an innovative encoding of functions into specialized tokens, allowing the model to recognize and execute function calls without the necessity of processing vast amounts of contextual data. Such efficiency not only enhances the model's speed but also its applicability across a wider range of devices by lowering the computational load.

The Octopus v2 Methodology

Beyond Traditional Deployments

The paper presents a detailed examination of deploying fine-tuned LLMs on edge devices, focusing on challenges like latency and the accuracy of function calls. The methodology capitalizes on the concept of transforming functions into unique functional tokens during the model's training phase. This approach simplifies the function calling process, essentially condensing it into a single-token prediction problem, thus substantially enhancing both the accuracy and speed of execution.

Comprehensive Dataset Collection and Training

The authors meticulously compiled a dataset encompassing a wide array of Android APIs, categorized based on their relevance and frequency of use. This dataset underpins the fine-tuning process, which employs both full model and LoRA training approaches to optimize performance. The nuanced training strategy not only improved the model’s understanding of the functional tokens but also facilitated a meaningful reduction in latency.

Experimentation and Results

Extensive benchmarking tests underscore the efficiency and accuracy of Octopus v2 in generating function calls. Notably, when benchmarked against current leading models such as GPT-4 and GPT-3.5, Octopus v2 exhibits superior performance, especially in latency metrics. This indicates a substantial leap forward in on-device AI capabilities, rendering Octopus v2 a formidable contender in the domain of AI-driven function calling.

Implications and Future Directions

Practical Applications

The practical applications of Octopus v2 are vast. Developers across the spectrum, from mobile applications to automotive software, can leverage this model to integrate sophisticated AI functionalities directly into their products, circumventing the high costs and privacy risks associated with cloud-based models. The efficiency and accuracy of Octopus v2 suggest it could become a foundational component in the next generation of on-device AI agents, potentially transforming user interactions with a wide range of technologies.

Future Developments

Looking ahead, the paper suggests avenues for future research, particularly in the field of on-device reasoning alternatives to further enhance performance efficiencies and reduce operational costs. The aspiration is to develop models that can operate both in cloud and on-device environments, offering flexible deployment options catering to privacy, cost, and speed preferences.

Conclusion

In conclusion, the Octopus v2 paper introduces a significant advancement in the field of on-device AI and function-calling LLMs. By addressing core issues of latency, accuracy, and context length, Octopus v2 sets a new benchmark for what is achievable with on-device AI agents. The implications of this research are profound, potentially enabling a more pervasive integration of AI functionalities across devices and platforms without the constraints currently imposed by cloud reliance. As AI continues to evolve, on-device models like Octopus v2 represent a critical step toward realizing the full potential of AI in everyday applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. Generative ai at work. Technical report, National Bureau of Economic Research, 2023.
  4. Rich Caruana. Multitask learning. Machine learning, 28:41–75, 1997.
  5. Imran Chaudhri. Humane ai, 2024. URL https://humane.com/. Accessed on March 31, 2024.
  6. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
  7. Towards next-generation intelligent assistants leveraging llm techniques. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5792–5793, 2023.
  8. Anytool: Self-reflective, hierarchical agents for large-scale api calls. arXiv preprint arXiv:2402.04253, 2024.
  9. Div Garg. Multion ai, 2024. URL https://www.multion.ai/. Accessed on March 31, 2024.
  10. Gemma Team, Google DeepMind. Gemma: Open models based on gemini research and technology, 2023. URL https://goo.gle/GemmaReport.
  11. Adapt and overcome: Perceptions of adaptive autonomous agents for human-ai teaming. Computers in Human Behavior, 138:107451, 2023.
  12. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023a.
  13. 3d-llm: Injecting the 3d world into large language models. Advances in Neural Information Processing Systems, 36:20482–20494, 2023b.
  14. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  15. Active retrieval augmented generation. arXiv preprint arXiv:2305.06983, 2023.
  16. Design of chain-of-thought in math problem solving, 2023.
  17. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  18. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023.
  19. A survey on retrieval-augmented text generation. arXiv preprint arXiv:2202.01110, 2022.
  20. Taskmatrix. ai: Completing tasks by connecting foundation models with millions of apis. arXiv preprint arXiv:2303.16434, 2023.
  21. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
  22. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499, 2023.
  23. Mobilellm: Optimizing sub-billion parameter language models for on-device use cases. arXiv preprint arXiv:2402.14905, 2024.
  24. llama.cpp team. llama-cpp. Software available at https://github.com/ggerganov/llama.cpp, 2023. Accessed on March 31, 2024.
  25. David Luan. Adept ai, 2024. URL https://www.adept.ai/. Accessed on March 31, 2024.
  26. Jesse Lyu. Rabbit r1, 2024. URL https://www.rabbit.tech/. Accessed on March 31, 2024.
  27. Generation-augmented retrieval for open-domain question answering. arXiv preprint arXiv:2009.08553, 2020.
  28. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014, 2023.
  29. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023.
  30. Stable code 3b, 2023. URL https://huggingface.co/stabilityai/stable-code-3b.
  31. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023.
  32. Improving language understanding by generative pre-training. OpenAI blog, 2018.
  33. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  34. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 11:1316–1331, 2023.
  35. Tptu: Task planning and tool usage of large language model-based ai agents. arXiv preprint arXiv:2308.03427, 2023.
  36. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36, 2024.
  37. Small llms are weak tool learners: A multi-llm agent. arXiv preprint arXiv:2401.07324, 2024a.
  38. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36, 2024b.
  39. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
  40. Accelerating llm inference with staged speculative decoding. arXiv preprint arXiv:2308.04623, 2023.
  41. Nexusraven: a commercially-permissive language model for function calling. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
  42. Multi-agent collaboration: Harnessing the power of intelligent llm agents. arXiv preprint arXiv:2306.03314, 2023.
  43. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. arXiv preprint arXiv:2306.05301, 2023.
  44. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  45. MLC team. Mlc-llm. Software available at https://github.com/mlc-ai/mlc-llm, 2023. Accessed on March 31, 2024.
  46. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  47. Trelis. Llama-2-7b-chat-hf-function-calling-v3: A model fine-tuned for function calling. Hugging Face Model Repository, 2023. URL https://huggingface.co/Trelis/Llama-2-7b-chat-hf-function-calling-v3. Accessed on March 31, 2024.
  48. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024.
  49. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432, 2023.
  50. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  51. Empowering llm to use smartphone for intelligent task automation. arXiv preprint arXiv:2308.15272, 2023.
  52. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023.
  53. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023.
  54. Llmcad: Fast and scalable on-device large language model inference. arXiv preprint arXiv:2309.04255, 2023.
  55. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023.
  56. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing, page 100211, 2024.
  57. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Wei Chen (1288 papers)
  2. Zhiyuan Li (304 papers)
Citations (16)
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

Reddit Logo Streamline Icon: https://streamlinehq.com