Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An LLM-Tool Compiler for Fused Parallel Function Calling (2405.17438v1)

Published 7 May 2024 in cs.PL, cs.AI, and cs.LG

Abstract: State-of-the-art sequential reasoning in LLMs has expanded the capabilities of Copilots beyond conversational tasks to complex function calling, managing thousands of API calls. However, the tendency of compositional prompting to segment tasks into multiple steps, each requiring a round-trip to the GPT APIs, leads to increased system latency and costs. Although recent advancements in parallel function calling have improved tool execution per API call, they may necessitate more detailed in-context instructions and task breakdown at the prompt level, resulting in higher engineering and production costs. Inspired by the hardware design principles of multiply-add (MAD) operations, which fuse multiple arithmetic operations into a single task from the compiler's perspective, we propose LLM-Tool Compiler, which selectively fuses similar types of tool operations under a single function at runtime, presenting them as a unified task to the LLM. This selective fusion inherently enhances parallelization and efficiency. Benchmarked on a large-scale Copilot platform, LLM-Tool Compiler achieves up to four times more parallel calls than existing methods, reducing token costs and latency by up to 40% and 12%, respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219 [cs.CL]
  2. Meta AI. 2024. Introducing Meta Llama 3: The most capable openly available LLM to date. Meta AI Blog (2024). https://ai.meta.com/blog/meta-llama-3/
  3. Hardware-Aware DNN Compression via Diverse Pruning and Mixed-Precision Quantization. IEEE Transactions on Emerging Topics in Computing (2024).
  4. Neuralpower: Predict and deploy energy-efficient convolutional neural networks. In Asian Conference on Machine Learning. PMLR, 622–637.
  5. Adapting Language Models to Compress Contexts. arXiv:2305.14788 [cs.CL]
  6. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. arXiv:2306.03078 [cs.CL]
  7. A Survey on In-context Learning. arXiv:2301.00234 [cs.CL]
  8. Hugging Face. 2024. Text Generation Inference: A Rust, Python, and gRPC server for text generation inference. https://github.com/huggingface/text-generation-inference.
  9. Extending Context Window of Large Language Models via Semantic Compression. arXiv:2312.09571 [cs.CL]
  10. GeckOpt: LLM System Efficiency via Intent-Based Tool Selection. In GLSVLSI 2024.
  11. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv:2210.17323 [cs.LG]
  12. LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. arXiv:2310.05736 [cs.CL]
  13. Adaptive Skeleton Graph Decoding. arXiv:2402.12280 [cs.CL]
  14. Andreas Karatzas and Iraklis Anagnostopoulos. 2023. OmniBoost: Boosting Throughput of Heterogeneous Embedded Devices under Multi-DNN Workload. In 2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6.
  15. SqueezeLLM: Dense-and-Sparse Quantization. arXiv:2306.07629 [cs.CL]
  16. An LLM Compiler for Parallel Function Calling. arXiv:2312.04511 [cs.CL]
  17. VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. arXiv:2401.13649 [cs.LG]
  18. A Fast Post-Training Pruning Framework for Transformers. arXiv:2204.09656 [cs.CL]
  19. Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv:2309.06180 [cs.LG]
  20. LangChain Docs. 2024. Retrieval Augmented Generation (RAG). https://python.langchain.com/docs/modules/data_connection/. Accessed: May-2024.
  21. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401 [cs.CL]
  22. Compressing Context to Enhance Inference Efficiency of Large Language Models. arXiv:2310.06201 [cs.CL]
  23. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978 [cs.CL]
  24. APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding. arXiv:2401.06761 [cs.CL]
  25. Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models. arXiv:2304.09842 [cs.CL]
  26. TOFU: A Task of Fictitious Unlearning for LLMs. arXiv:2401.06121 [cs.LG]
  27. OpenEQA: Embodied Question Answering in the Era of Foundation Models. In Conference on Computer Vision and Pattern Recognition (CVPR).
  28. Hardware-aware machine learning: Modeling and optimization. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 1–8.
  29. Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation. arXiv:2307.15337 [cs.CL]
  30. NVIDIA. 2024. TensorRT-LLM: A TensorRT Toolbox for Optimized Large Language Model Inference. https://github.com/NVIDIA/TensorRT-LLM.
  31. OpenAI API Docs. 2024. Function Calling. https://platform.openai.com/docs/guides/function-calling/parallel-function-calling/. Accessed: May-2024.
  32. OpenAI Developer community. 2024. Parallel Function Calling. https://community.openai.com/t/parallel-function-calling/. Accessed: May-2024.
  33. Evaluating Tool-Augmented Agents in Remote Sensing Platforms. In ICLR 2024 Workshop: 2nd Machine Learning for Remote Sensing Workshop.
  34. GeoLLM-Engine: A Realistic Environment for Building Geospatial Copilots. In CVPR 2024 Workshop EARTHVISION 2024.
  35. NexusRaven: a commercially-permissive Language Model for function calling. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following. https://openreview.net/forum?id=Md6RUrGz67
  36. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL]
  37. RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation. arXiv:2310.04408 [cs.CL]
  38. Berkeley Function Calling Leaderboard. https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html.
  39. MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action. arXiv:2303.11381 [cs.CV]
  40. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601 [cs.CL]
  41. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.CL]
  42. LLM Inference Unveiled: Survey and Roofline Model Insights. arXiv:2402.16363 [cs.CL]
  43. Efficiently Programming Large Language Models using SGLang. arXiv:2312.07104 [cs.AI]
  44. Efficient Prompting via Dynamic In-Context Learning. arXiv:2305.11170 [cs.CL]
  45. ToolQA: A Dataset for LLM Question Answering with External Tools. arXiv:2306.13304 [cs.CL]
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Simranjit Singh (10 papers)
  2. Andreas Karatzas (10 papers)
  3. Michael Fore (7 papers)
  4. Iraklis Anagnostopoulos (18 papers)
  5. Dimitrios Stamoulis (23 papers)
Citations (4)