MeanCache: User-Centric Semantic Cache for Large Language Model Based Web Services (2403.02694v3)
Abstract: LLMs like ChatGPT and Llama have revolutionized natural language processing and search engine dynamics. However, these models incur exceptionally high computational costs. For instance, GPT-3 consists of 175 billion parameters, where inference demands billions of floating-point operations. Caching is a natural solution to reduce LLM inference costs on repeated queries, which constitute about 31% of the total queries. However, existing caching methods are incapable of finding semantic similarities among LLM queries nor do they operate on contextual queries, leading to unacceptable false hit-and-miss rates. This paper introduces MeanCache, a user-centric semantic cache for LLM-based services that identifies semantically similar queries to determine cache hit or miss. Using MeanCache, the response to a user's semantically similar query can be retrieved from a local cache rather than re-querying the LLM, thus reducing costs, service provider load, and environmental impact. MeanCache leverages Federated Learning (FL) to collaboratively train a query similarity model without violating user privacy. By placing a local cache in each user's device and using FL, MeanCache reduces the latency and costs and enhances model performance, resulting in lower false hit rates. MeanCache also encodes context chains for every cached query, offering a simple yet highly effective mechanism to discern contextual query responses from standalone. Our experiments benchmarked against the state-of-the-art caching method, reveal that MeanCache attains an approximately 17% higher F-score and a 20% increase in precision during semantic cache hit-and-miss decisions while performing even better on contextual queries. It also reduces the storage requirement by 83% and accelerates semantic cache hit-and-miss decisions by 11%.
- [n.d.]. AI is expensive. A search on Google’s chatbot Bard costs the company 10 times more than a regular one, which could amount to several billion dollars. https://www.businessinsider.nl/ai-is-expensive-a-search-on-googles-chatbot-bard-costs-the-company-10-times-more-than-a-regular-one-which-could-amount-to-several-billion-dollars/. (Accessed on 01/18/2024).
- [n.d.]. Arc Max is the popular browser’s new suite of AI tools - The Verge. https://www.theverge.com/2023/10/3/23898907/arc-max-ai-browser-mac-ios. (Accessed on 01/19/2024).
- [n.d.]. Google AI updates: Bard and new AI features in Search. https://blog.google/technology/ai/bard-google-ai-search-updates/. (Accessed on 01/18/2024).
- [n.d.]. GPTCache/examples/benchmark at main · zilliztech/GPTCache. https://github.com/zilliztech/GPTCache/tree/main/examples/benchmark. (Accessed on 03/04/2024).
- [n.d.]. Introducing ChatGPT. https://openai.com/blog/chatgpt. (Accessed on 01/18/2024).
- [n.d.]. Introducing Claude 2.1 \ Anthropic. https://www.anthropic.com/news/claude-2-1. (Accessed on 01/18/2024).
- [n.d.]. Learning human actions on computer applications. https://www.rabbit.tech/research. (Accessed on 01/19/2024).
- [n.d.]. Microsoft Edge and Bing Users Engage in Over 1.9 Billion Copilot Chats in 2023. https://www.msn.com/en-us/money/other/microsoft-edge-and-bing-users-engage-in-over-19-billion-copilot-chats-in-2023/ar-AA1mmrxZ. (Accessed on 01/22/2024).
- [n.d.]. OpenAI Pricing. https://openai.com/pricing. (Accessed on 01/19/2024).
- [n.d.]a. Perplexity Pro. https://www.perplexity.ai/pro. (Accessed on 03/01/2024).
- [n.d.]. Sizing Guide - NVIDIA Docs. https://docs.nvidia.com/ai-enterprise/workflows-generative-ai/0.1.0/sizing-guide.html. (Accessed on 01/25/2024).
- [n.d.]. [User] Embedding doesn’t seem to work? · Issue #899 · ggerganov/llama.cpp. https://github.com/ggerganov/llama.cpp/issues/899. (Accessed on 01/18/2024).
- [n.d.]b. What is Perplexity? https://blog.perplexity.ai/faq/what-is-perplexity. (Accessed on 01/19/2024).
- [n.d.]. zilliztech/GPTCache: Semantic cache for LLMs. Fully integrated with LangChain and llama_index. https://github.com/zilliztech/gptcache. (Accessed on 03/03/2024).
- Dmitrii Avdiukhin and Shiva Kasiviswanathan. 2021. Federated learning under arbitrary communication patterns. In International Conference on Machine Learning. PMLR, 425–435.
- The impact of caching on search engines. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. 183–190.
- Ricardo Baeza-Yates and Felipe Saint-Jean. 2003. A three level search engine index based in query log distribution. In String Processing and Information Retrieval: 10th International Symposium, SPIRE 2003, Manaus, Brazil, October 8-10, 2003. Proceedings 10. Springer, 56–65.
- Fu Bang. 2023. GPTCache: An Open-Source Semantic Cache for LLM Applications Enabling Faster Answers and Cost Savings. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023). 212–218.
- Flower: A Friendly Federated Learning Research Framework. arXiv preprint arXiv:2007.14390 (2020).
- Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems 30, 1-7 (1998), 107–117.
- Yilun Du and Leslie Kaelbling. 2024. Compositional Generative Modeling: A Single Model is Not All You Need. arXiv preprint arXiv:2402.01103 (2024).
- Boosting the performance of web search engines: Caching and prefetching query results by exploiting historical usage data. ACM Transactions on Information Systems (TOIS) 24, 1 (2006), 51–78.
- OPTQ: Accurate Quantization for Generative Pre-trained Transformers. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview.net/pdf?id=tcbBPnfwxS
- Efficient natural language response suggestion for smart reply. arXiv preprint arXiv:1705.00652 (2017).
- Harold Hotelling. 1933. Analysis of a complex of statistical variables into principal components. Journal of educational psychology 24, 6 (1933), 417.
- Real life, real users, and real needs: a study and analysis of user queries on the web. Information processing & management 36, 2 (2000), 207–227.
- Ian T Jolliffe and Jorge Cadima. 2016. Principal component analysis: a review and recent developments. Philosophical transactions of the royal society A: Mathematical, Physical and Engineering Sciences 374, 2065 (2016), 20150202.
- Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 6769–6781. https://doi.org/10.18653/v1/2020.emnlp-main.550
- ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. CoRR abs/1909.11942 (2019). arXiv:1909.11942 http://arxiv.org/abs/1909.11942
- Ronny Lempel and Shlomo Moran. 2003. Predictive caching and prefetching of query results in search engines. In Proceedings of the 12th international conference on World Wide Web. 19–28.
- Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems 2 (2020), 429–450.
- Xiaohui Long and Torsten Suel. 2005. Three-level caching for efficient query processing in large web search engines. In Proceedings of the 14th international conference on World Wide Web. 257–266.
- The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits. arXiv preprint arXiv:2402.17764 (2024).
- New caching techniques for web search engines. In Proceedings of the 19th ACM international symposium on high performance distributed computing. 215–226.
- Evangelos P. Markatos. 2001. On caching search engine query results. Computer Communications 24, 2 (2001), 137–143.
- Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics. PMLR, 1273–1282.
- Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models. In Findings of the Association for Computational Linguistics: ACL 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 1864–1874. https://doi.org/10.18653/v1/2022.findings-acl.146
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
- Karl Pearson. 1901. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 11 (1901), 559–572. https://doi.org/10.1080/14786440109462720 arXiv:https://doi.org/10.1080/14786440109462720
- The refinedweb dataset for falcon llm: Outperforming curated corpora with web data only. Advances in Neural Information Processing Systems 36 (2024).
- Stefan Podlipnig and Laszlo Böszörmenyi. 2003. A survey of web cache replacement strategies. ACM Computing Surveys (CSUR) 35, 4 (2003), 374–398.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, Hong Kong, China, 3982–3992. https://doi.org/10.18653/v1/D19-1410
- How to Train Data-Efficient LLMs. arXiv preprint arXiv:2402.09668 (2024).
- Rank-preserving two-level caching for scalable search engines. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. 51–58.
- Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems 33 (2020), 16857–16867.
- Energy and Policy Considerations for Deep Learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 3645–3650.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks. arXiv preprint arXiv:2402.04396 (2024).
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Bitnet: Scaling 1-bit transformers for large language models. arXiv preprint arXiv:2310.11453 (2023).
- Federated Learning with Matched Averaging. In International Conference on Learning Representations. https://openreview.net/forum?id=BkluqlSFDS
- Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in neural information processing systems 33 (2020), 7611–7623.
- Yinglian Xie and David O’Hallaron. 2002. Locality in search engine queries and its implications for caching. In Proceedings. Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies, Vol. 3. IEEE, 1238–1247.
- Performance of compressed inverted list caching in search engines. In Proceedings of the 17th international conference on World Wide Web. 387–396.
- Towards Optimal Caching and Model Selection for Large Model Inference. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=gd20oaZqqF
- Waris Gill (8 papers)
- Mohamed Elidrisi (4 papers)
- Pallavi Kalapatapu (1 paper)
- Ali Anwar (64 papers)
- Muhammad Ali Gulzar (15 papers)
- Ammar Ahmed (12 papers)