Emergent Mind

Octopus v2: On-device language model for super agent

(2404.01744)
Published Apr 2, 2024 in cs.CL

Abstract

Language models have shown effectiveness in a variety of software applications, particularly in tasks related to automatic workflow. These models possess the crucial ability to call functions, which is essential in creating AI agents. Despite the high performance of large-scale language models in cloud environments, they are often associated with concerns over privacy and cost. Current on-device models for function calling face issues with latency and accuracy. Our research presents a new method that empowers an on-device model with 2 billion parameters to surpass the performance of GPT-4 in both accuracy and latency, and decrease the context length by 95\%. When compared to Llama-7B with a RAG-based function calling mechanism, our method enhances latency by 35-fold. This method reduces the latency to levels deemed suitable for deployment across a variety of edge devices in production environments, aligning with the performance requisites for real-world applications.

Comparison of accuracy among Llama-7B, GPT-3.5/4, and various Octopus models using different training methods.

Overview

  • Octopus v2 is a groundbreaking on-device language model developed to enhance function-calling capabilities with reduced latency and privacy concerns.

  • This model demonstrates remarkable improvements in scalability, performance, and efficiency over its predecessors and competitors like GPT-4 and Llama-7B.

  • It introduces an innovative approach of encoding functions into specialized tokens, significantly reducing context length and computational load.

  • Extensive experimentation shows Octopus v2's superior performance in accuracy and latency, suggesting wide practical applications and setting a new benchmark for on-device AI capabilities.

Enhancing On-Device AI Agents with Octopus v2: A Leap in Function-Calling Efficiency

Introduction to On-Device AI Innovation

The introduction of Octopus v2 represents a significant milestone in the evolution of on-device language models tailored for sophisticated AI agent applications. Developed by Chen and Li at Stanford University, this model addresses the cardinal challenges of deploying advanced function-calling capabilities directly on edge devices. Unlike its predecessors, Octopus v2 not only reduces dependence on cloud computing resources but also significantly diminishes latency, cost, and privacy concerns associated with large-scale language models.

Addressing the Challenges of On-Device Deployment

Scalability and Performance

Octopus v2 introduces a methodological leap in enhancing on-device AI agents' functionality, particularly in executing complex function calls within software applications. This research delineates a novel approach allowing a 2 billion parameter model to outperform GPT-4 in accuracy and latency metrics notably. In direct comparison with Llama-7B, a model known for its Retrieval-Augmented Generation (RAG) for function calling, Octopus v2 exhibits a 35-fold improvement in latency. This advancement is critical, as it brings on-device models a step closer to cloud-based giants in terms of performance, while significantly reducing operational costs and risk exposure to privacy breaches.

Efficiency and Reduced Context Length

One of the cornerstone achievements of Octopus v2 is its managerial efficiency regarding context length - a reduction by 95%. This is achieved through an innovative encoding of functions into specialized tokens, allowing the model to recognize and execute function calls without the necessity of processing vast amounts of contextual data. Such efficiency not only enhances the model's speed but also its applicability across a wider range of devices by lowering the computational load.

The Octopus v2 Methodology

Beyond Traditional Deployments

The paper presents a detailed examination of deploying fine-tuned language models on edge devices, focusing on challenges like latency and the accuracy of function calls. The methodology capitalizes on the concept of transforming functions into unique functional tokens during the model's training phase. This approach simplifies the function calling process, essentially condensing it into a single-token prediction problem, thus substantially enhancing both the accuracy and speed of execution.

Comprehensive Dataset Collection and Training

The authors meticulously compiled a dataset encompassing a wide array of Android APIs, categorized based on their relevance and frequency of use. This dataset underpins the fine-tuning process, which employs both full model and LoRA training approaches to optimize performance. The nuanced training strategy not only improved the model’s understanding of the functional tokens but also facilitated a meaningful reduction in latency.

Experimentation and Results

Extensive benchmarking tests underscore the efficiency and accuracy of Octopus v2 in generating function calls. Notably, when benchmarked against current leading models such as GPT-4 and GPT-3.5, Octopus v2 exhibits superior performance, especially in latency metrics. This indicates a substantial leap forward in on-device AI capabilities, rendering Octopus v2 a formidable contender in the domain of AI-driven function calling.

Implications and Future Directions

Practical Applications

The practical applications of Octopus v2 are vast. Developers across the spectrum, from mobile applications to automotive software, can leverage this model to integrate sophisticated AI functionalities directly into their products, circumventing the high costs and privacy risks associated with cloud-based models. The efficiency and accuracy of Octopus v2 suggest it could become a foundational component in the next generation of on-device AI agents, potentially transforming user interactions with a wide range of technologies.

Future Developments

Looking ahead, the paper suggests avenues for future research, particularly in the realm of on-device reasoning alternatives to further enhance performance efficiencies and reduce operational costs. The aspiration is to develop models that can operate both in cloud and on-device environments, offering flexible deployment options catering to privacy, cost, and speed preferences.

Conclusion

In conclusion, the Octopus v2 paper introduces a significant advancement in the field of on-device AI and function-calling language models. By addressing core issues of latency, accuracy, and context length, Octopus v2 sets a new benchmark for what is achievable with on-device AI agents. The implications of this research are profound, potentially enabling a more pervasive integration of AI functionalities across devices and platforms without the constraints currently imposed by cloud reliance. As AI continues to evolve, on-device models like Octopus v2 represent a critical step toward realizing the full potential of AI in everyday applications.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
HackerNews
Reddit
Octopus v2: On-device language model for super agent (1 point, 0 comments) in /r/hypeurls
References
  1. GPT-4 Technical Report
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901
  3. Generative ai at work. Technical report, National Bureau of Economic Research
  4. Rich Caruana. Multitask learning. Machine learning, 28:41–75
  5. Imran Chaudhri. Humane ai, 2024. https://humane.com/. Accessed on March 31

  6. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36
  7. Towards next-generation intelligent assistants leveraging llm techniques. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5792–5793
  8. AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls
  9. Div Garg. Multion ai, 2024. https://www.multion.ai/. Accessed on March 31

  10. Gemma Team, Google DeepMind. Gemma: Open models based on gemini research and technology, 2023. https://goo.gle/GemmaReport.

  11. Adapt and overcome: Perceptions of adaptive autonomous agents for human-ai teaming. Computers in Human Behavior, 138:107451
  12. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
  13. 3d-llm: Injecting the 3d world into large language models. Advances in Neural Information Processing Systems, 36:20482–20494, 2023b.
  14. LoRA: Low-Rank Adaptation of Large Language Models
  15. Active Retrieval Augmented Generation
  16. Design of chain-of-thought in math problem solving
  17. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474
  18. MIMIC-IT: Multi-Modal In-Context Instruction Tuning
  19. A Survey on Retrieval-Augmented Text Generation
  20. TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs
  21. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
  22. Prompt Injection attack against LLM-integrated Applications
  23. MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
  24. llama.cpp team. llama-cpp. Software available at https://github.com/ggerganov/llama.cpp, 2023. Accessed on March 31

  25. David Luan. Adept ai, 2024. https://www.adept.ai/. Accessed on March 31

  26. Jesse Lyu. Rabbit r1, 2024. https://www.rabbit.tech/. Accessed on March 31

  27. Generation-Augmented Retrieval for Open-domain Question Answering
  28. ART: Automatic multi-step reasoning and tool-use for large language models
  29. Gorilla: Large Language Model Connected with Massive APIs
  30. Stable code 3b, 2023. https://huggingface.co/stabilityai/stable-code-3b.

  31. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
  32. Improving language understanding by generative pre-training. OpenAI blog
  33. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9
  34. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 11:1316–1331
  35. TPTU: Large Language Model-based AI Agents for Task Planning and Tool Usage
  36. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36
  37. Small LLMs Are Weak Tool Learners: A Multi-LLM Agent
  38. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36, 2024b.
  39. Reflexion: Language Agents with Verbal Reinforcement Learning
  40. Accelerating LLM Inference with Staged Speculative Decoding
  41. Nexusraven: a commercially-permissive language model for function calling. In NeurIPS 2023 Foundation Models for Decision Making Workshop
  42. Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents
  43. ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases
  44. Gemini: A Family of Highly Capable Multimodal Models
  45. MLC team. Mlc-llm. Software available at https://github.com/mlc-ai/mlc-llm, 2023. Accessed on March 31

  46. Llama 2: Open Foundation and Fine-Tuned Chat Models
  47. Trelis. Llama-2-7b-chat-hf-function-calling-v3: A model fine-tuned for function calling. Hugging Face Model Repository, 2023. https://huggingface.co/Trelis/Llama-2-7b-chat-hf-function-calling-v3. Accessed on March 31

  48. Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
  49. A Survey on Large Language Model based Autonomous Agents
  50. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837
  51. AutoDroid: LLM-powered Task Automation in Android
  52. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
  53. The Rise and Potential of Large Language Model Based Agents: A Survey
  54. LLMCad: Fast and Scalable On-device Large Language Model Inference
  55. AppAgent: Multimodal Agents as Smartphone Users
  56. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing, page 100211
  57. Automatic Chain of Thought Prompting in Large Language Models

Show All 57