Cache me if you Can: an Online Cost-aware Teacher-Student framework to Reduce the Calls to Large Language Models (2310.13395v1)
Abstract: Prompting LLMs performs impressively in zero- and few-shot settings. Hence, small and medium-sized enterprises (SMEs) that cannot afford the cost of creating large task-specific training datasets, but also the cost of pretraining their own LLMs, are increasingly turning to third-party services that allow them to prompt LLMs. However, such services currently require a payment per call, which becomes a significant operating expense (OpEx). Furthermore, customer inputs are often very similar over time, hence SMEs end-up prompting LLMs with very similar instances. We propose a framework that allows reducing the calls to LLMs by caching previous LLM responses and using them to train a local inexpensive model on the SME side. The framework includes criteria for deciding when to trust the local model or call the LLM, and a methodology to tune the criteria and measure the tradeoff between performance and cost. For experimental purposes, we instantiate our framework with two LLMs, GPT-3.5 or GPT-4, and two inexpensive students, a k-NN classifier or a Multi-Layer Perceptron, using two common business tasks, intent recognition and sentiment analysis. Experimental results indicate that significant OpEx savings can be obtained with only slightly lower performance.
- Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
- Algorithms for Hyper-Parameter Optimization. In Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc.
- Efficient intent detection with dual sentence encoders. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pages 38–45, Online. Association for Computational Linguistics.
- Knowledge distillation: A survey. Int. J. Comput. Vision, 129(6):1789–1819.
- End-to-end neural pipeline for goal-oriented dialogue systems using GPT-2. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 583–592, Online. Association for Computational Linguistics.
- Distilling the Knowledge in a Neural Network. ArXiv, abs/1503.02531.
- OpenAssistant Conversations – Democratizing Large Language Model Alignment.
- Task-adaptive Pre-training and Self-training are Complementary for Natural Language Understanding. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1006–1015, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
- Self-training Improves Pre-training for Few-shot Learning in Task-oriented Dialog Systems. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1887–1898, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Robert Monarch. 2021. Human-in-the-Loop Machine Learning. Manning Publications.
- How to Measure Uncertainty in Uncertainty Sampling for Active Learning. Mach. Learn., 111(1):89–122.
- OpenAI. 2023. GPT-4 technical report. ArXiv, abs/2303.08774.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
- Burr Settles. 2012. Active Learning. Morgan & Claypool Publishers.
- Mpnet: Masked and permuted pre-training for language understanding. In NeurIPS 2020. ACM.