Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent (2404.11459v2)
Abstract: A multimodal AI agent is characterized by its ability to process and learn from various types of data, including natural language, visual, and audio inputs, to inform its actions. Despite advancements in LLMs that incorporate visual data, such as GPT-4V, effectively translating image-based data into actionable outcomes for AI agents continues to be challenging. In this paper, we introduce a multimodal model that incorporates the concept of functional token specifically designed for AI agent applications. To ensure compatibility with edge devices, our model is optimized to a compact size of less than 1B parameters. Like GPT-4, our model can process both English and Chinese. We demonstrate that this model is capable of operating efficiently on a wide range of edge devices, including as constrained as a Raspberry Pi.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Multimodal biomedical ai. Nature Medicine, 28(9):1773–1784, 2022.
- Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021.
- Quantile qt-opt for risk-aware vision-based robotic grasping. arXiv preprint arXiv:1910.02787, 2019.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Generative ai at work. Technical report, National Bureau of Economic Research, 2023.
- Rich Caruana. Multitask learning. Machine learning, 28:41–75, 1997.
- Learning to explore using active neural slam. arXiv preprint arXiv:2004.05155, 2020.
- Evaluating visual conversational agents via cooperative human-ai games. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 5, pages 2–10, 2017.
- Imran Chaudhri. Humane ai, 2024. URL https://humane.com/. Accessed on March 31, 2024.
- Octopus v2: On-device language model for super agent. arXiv preprint arXiv:2404.01744, 2024.
- Octopus: On-device language model for function calling of software apis. arXiv preprint arXiv:2404.01549, 2024.
- A survey on multimodal large language models for autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 958–979, 2024.
- Embodied question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–10, 2018.
- Towards next-generation intelligent assistants leveraging llm techniques. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5792–5793, 2023.
- Drive like a human: Rethinking autonomous driving with large language models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 910–919, 2024.
- Gemma Team, Google DeepMind. Gemma: Open models based on gemini research and technology, 2023. URL https://goo.gle/GemmaReport.
- Large language model ai chatbots require approval as medical devices. Nature Medicine, 29(10):2396–2398, 2023.
- A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, 45(1):87–110, 2022.
- Adapt and overcome: Perceptions of adaptive autonomous agents for human-ai teaming. Computers in Human Behavior, 138:107451, 2023.
- Sim2real predictivity: Does evaluation in simulation predict real-world performance? IEEE Robotics and Automation Letters, 5(4):6670–6677, 2020.
- Vilt: Vision-and-language transformer without convolution or region supervision. In International conference on machine learning, pages 5583–5594. PMLR, 2021.
- Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474, 2017.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
- Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
- A survey of convolutional neural networks: analysis, applications, and prospects. IEEE transactions on neural networks and learning systems, 33(12):6999–7019, 2021.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
- Visual instruction tuning. Advances in neural information processing systems, 36, 2024a.
- Mobilellm: Optimizing sub-billion parameter language models for on-device use cases. arXiv preprint arXiv:2402.14905, 2024b.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019.
- Jesse Lyu. Rabbit r1, 2024. URL https://www.rabbit.tech/. Accessed on March 31, 2024.
- From google gemini to openai q*(q-star): A survey of reshaping the generative artificial intelligence (ai) research landscape. arXiv preprint arXiv:2312.10868, 2023.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
- A study of generative large language model for medical research and healthcare. NPJ Digital Medicine, 6(1):210, 2023.
- Generating diverse structure for image inpainting with hierarchical vq-vae. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10775–10784, 2021.
- Stable code 3b, 2023. URL https://huggingface.co/stabilityai/stable-code-3b.
- Improving language understanding by generative pre-training. OpenAI blog, 2018.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019.
- Multimodal emotion recognition. In Handbook of pattern recognition and computer vision, pages 387–409. World Scientific, 2005.
- Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020.
- A survey of multimodal sentiment analysis. Image and Vision Computing, 65:3–14, 2017.
- Nexusraven: a commercially-permissive language model for function calling. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
- Srikanth Tammina. Transfer learning using vgg-16 with deep convolutional neural network for classifying images. International Journal of Scientific and Research Publications (IJSRP), 9(10):143–150, 2019.
- Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for computational linguistics. Meeting, volume 2019, page 6558. NIH Public Access, 2019.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024.
- A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432, 2023.
- The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023.