LLMProxy: Reducing Cost to Access Large Language Models (2410.11857v1)
Abstract: In this paper, we make a case for a proxy for LLMs which has explicit support for cost-saving optimizations. We design LLMProxy, which supports three key optimizations: model selection, context management, and caching. These optimizations present tradeoffs in terms of cost, inference time, and response quality, which applications can navigate through our high level, bidirectional interface. As a case study, we implement a WhatsApp-based Q&A service that uses LLMProxy to provide a rich set of features to the users. This service is deployed on a small scale (100+ users) leveraging the cloud; it has been operational for 15+ weeks and users have asked 1400+ questions so far. We report on the experiences of running this service as well as microbenchmark the specific benefits of the various cost-optimizations we present in this paper.
- Perplexity Sonar Models . https://docs.perplexity.ai/guides/model-cards#perplexity-sonar-models/.
- Wikipedia. https://www.wikipedia.org/.
- Focus: For tech giants, AI like Bing and Bard poses billion-dollar search problem. https://www.reuters.com/technology/tech-giants-ai-like-bing-bard-poses-billion-dollar-search-problem-2023-02-22/, 2023.
- No GPU? No problem. localllm lets you develop gen AI apps on local CPUs. https://cloud.google.com/blog/products/application-development/new-localllm-lets-you-develop-gen-ai-apps-locally-without-gpus, 2023.
- Amazon Bedrock Pricing - AWS. https://aws.amazon.com/bedrock/pricing/, 2024.
- Assitants overview - OpenAI API. https://platform.openai.com/docs/assistants/overview, 2024.
- ChatGPT Pricing | OpenAI. https://openai.com/chatgpt/pricing/, 2024.
- Cloud Computing Services - Amazon Web Services (AWS). https://aws.amazon.com, 2024.
- Conversation Summary | LangChain. https://python.langchain.com/v0.1/docs/modules/memory/types/summary/, 2024.
- Embeddings - OpenAI API. https://platform.openai.com/docs/guides/embeddings, 2024.
- Introducing Phi-3: Redefining what’s possible with SLMs . https://azure.microsoft.com/en-us/blog/introducing-phi-3-redefining-whats-possible-with-slms/, 2024.
- Introduction - OpenAI API. https://platform.openai.com/docs/introduction, 2024.
- Leveraging phi-3 for an Enhanced Semantic Cache in RAG Applications. https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/leveraging-phi-3-for-an-enhanced-semantic-cache-in-rag/ba-p/4193112, 2024.
- LLM Economics: ChatGPT vs Open-Source. https://towardsdatascience.com/llm-economics-chatgpt-vs-open-source-dfc29f69fec1, 2024.
- Meet Claude Anthropic. https://www.anthropic.com/claude, 2024.
- Memcached. https://memcached.org/, 2024.
- OpenAI Recommendations on Optimizing Accuracy. https://platform.openai.com/docs/guides/optimizing-llm-accuracy, 2024.
- WhatsApp | Secure and Reliable Free Private Messaging and Calling. https://www.whatsapp.com, 2024.
- WhatsApp Users by Country 2024. https://worldpopulationreview.com/country-rankings/whatsapp-users-by-country, 2024.
- Serverless computing: economic and architectural impact. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2017, page 884–889, New York, NY, USA, 2017. Association for Computing Machinery.
- Flywheel: Google’s data compression proxy for the mobile web. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15), pages 367–380, Oakland, CA, May 2015. USENIX Association.
- Validated digital literacy measures for populations with low levels of internet experiences. Development Engineering, 8:100107, 2023.
- Fu Bang. GPTCache: An open-source semantic cache for LLM applications enabling faster answers and cost savings. In Liling Tan, Dmitrijs Milajevs, Geeticka Chauhan, Jeremy Gwinnup, and Elijah Rippeth, editors, Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 212–218, Singapore, December 2023. Association for Computational Linguistics.
- Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
- Pepsal: a performance enhancing proxy designed for tcp satellite connections. In 2006 IEEE 63rd Vehicular Technology Conference, volume 6, pages 2607–2611, 2006.
- Harrison Chase. LangChain, October 2022.
- Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023.
- Appx: an automated app acceleration framework for low latency mobile app. In Proceedings of the 14th International Conference on Emerging Networking EXperiments and Technologies, CoNEXT ’18, page 27–40, New York, NY, USA, 2018. Association for Computing Machinery.
- Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2019.
- Hybrid llm: Cost-efficient and quality-aware query routing. In ICLR 2024, February 2024.
- Missit: Using missed calls for free, extremely low bit-rate communication in developing regions. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI), pages 1–12, 2020.
- Catnap: exploiting high bandwidth wireless interfaces to save energy for mobile devices. In Proceedings of the 8th International Conference on Mobile Systems, Applications, and Services, MobiSys ’10, page 107–122, New York, NY, USA, 2010. Association for Computing Machinery.
- Precise zero-shot dense retrieval without relevance labels. arXiv preprint arXiv:2212.10496, 2022.
- Privacy-aware semantic cache for large language models. arXiv preprint arXiv:2403.02694, 2024.
- Performance Enhancing Proxies Intended to Mitigate Link-Related Degradations. RFC 3135, June 2001.
- Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
- Measuring and improving the reliability of wide-area cloud paths. In Proceedings of the 26th International Conference on World Wide Web, WWW ’17, page 253–262, Republic and Canton of Geneva, CHE, 2017. International World Wide Web Conferences Steering Committee.
- Twips: A large language model powered texting application to simplify conversational nuances for autistic users. In Proceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility (to appear), 2024.
- Automl: A survey of the state-of-the-art. Knowledge-based systems, 212:106622, 2021.
- Llm-blender: Ensembling large language models with pairwise comparison and generative fusion. In Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL 2023), 2023.
- Ragcache: Efficient knowledge caching for retrieval-augmented generation. arXiv preprint arXiv:2404.12457, 2024.
- Understanding the benefits and challenges of deploying conversational ai leveraging large language models for public health intervention. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23, New York, NY, USA, 2023. Association for Computing Machinery.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. arXiv preprint arXiv:2005.11401, 2021.
- Parrot: Efficient serving of LLM-based applications with semantic variable. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 929–945, Santa Clara, CA, July 2024. USENIX Association.
- Divided at the edge - measuring performance and the digital divide of cloud edge data centers. Proc. ACM Netw., 1(CoNEXT3), nov 2023.
- Openelm: An efficient language model family with open training and inference framework. arXiv preprint arXiv:2404.14619, 2024.
- Large language models challenge the future of higher education. Nature Machine Intelligence, 5(4):333–334, Apr 2023.
- Polaris: Faster page loads using fine-grained dependency tracking. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), Santa Clara, CA, March 2016. USENIX Association.
- Remote-control caching: Proxy-based url rewriting to decrease mobile browsing bandwidth. In Proceedings of the 19th International Workshop on Mobile Computing Systems & Applications, HotMobile ’18, page 63–68, New York, NY, USA, 2018. Association for Computing Machinery.
- Routellm: Learning to route llms with preference data. arXiv preprint arXiv:2406.18665, 2024.
- Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23, New York, NY, USA, 2023. Association for Computing Machinery.
- Characterizing internet access and quality inequities in california m-lab measurements. In ACM SIGCAS/SIGCHI Conference on Computing and Sustainable Societies (COMPASS), COMPASS ’22, page 257–265, New York, NY, USA, 2022. Association for Computing Machinery.
- Llms for test input generation for semantic applications. In Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI, CAIN ’24, page 160–165, New York, NY, USA, 2024. Association for Computing Machinery.
- Llempower: Understanding disparities in the control and access of large language models. arXiv preprint arXiv:2404.09356, 2024.
- Towards optimizing the costs of llm usage. arXiv preprint arXiv:2402.01742, 2024.
- Learning to decode collaboratively with multiple language models, 2024.
- Fairness in serving large language models. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 965–988, Santa Clara, CA, July 2024. USENIX Association.
- Teola: Towards end-to-end optimization of llm-based applications. arXiv preprint arXiv:2407.00326, 2024.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Prompting palm for translation: Assessing strategies and performance. arXiv preprint arXiv:2211.09102, 2022.
- Wikipedia. Content-addressable memory. http://en.wikipedia.org/w/index.php?title=Content-addressable%20memory&oldid=1221479855, 2024.
- Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, Carlsbad, CA, July 2022. USENIX Association.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
- Efficient prompt caching via embedding similarity. arXiv preprint arXiv:2402.01173, 2024.
- Pre-trained language model-based retrieval and ranking for web search. ACM Trans. Web, 17(1), dec 2022.