Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cache me if you Can: an Online Cost-aware Teacher-Student framework to Reduce the Calls to Large Language Models (2310.13395v1)

Published 20 Oct 2023 in cs.CL

Abstract: Prompting LLMs performs impressively in zero- and few-shot settings. Hence, small and medium-sized enterprises (SMEs) that cannot afford the cost of creating large task-specific training datasets, but also the cost of pretraining their own LLMs, are increasingly turning to third-party services that allow them to prompt LLMs. However, such services currently require a payment per call, which becomes a significant operating expense (OpEx). Furthermore, customer inputs are often very similar over time, hence SMEs end-up prompting LLMs with very similar instances. We propose a framework that allows reducing the calls to LLMs by caching previous LLM responses and using them to train a local inexpensive model on the SME side. The framework includes criteria for deciding when to trust the local model or call the LLM, and a methodology to tune the criteria and measure the tradeoff between performance and cost. For experimental purposes, we instantiate our framework with two LLMs, GPT-3.5 or GPT-4, and two inexpensive students, a k-NN classifier or a Multi-Layer Perceptron, using two common business tasks, intent recognition and sentiment analysis. Experimental results indicate that significant OpEx savings can be obtained with only slightly lower performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
  1. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
  2. Algorithms for Hyper-Parameter Optimization. In Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc.
  3. Efficient intent detection with dual sentence encoders. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pages 38–45, Online. Association for Computational Linguistics.
  4. Knowledge distillation: A survey. Int. J. Comput. Vision, 129(6):1789–1819.
  5. End-to-end neural pipeline for goal-oriented dialogue systems using GPT-2. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 583–592, Online. Association for Computational Linguistics.
  6. Distilling the Knowledge in a Neural Network. ArXiv, abs/1503.02531.
  7. OpenAssistant Conversations – Democratizing Large Language Model Alignment.
  8. Task-adaptive Pre-training and Self-training are Complementary for Natural Language Understanding. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1006–1015, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  9. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
  10. Self-training Improves Pre-training for Few-shot Learning in Task-oriented Dialog Systems. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1887–1898, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  11. Robert Monarch. 2021. Human-in-the-Loop Machine Learning. Manning Publications.
  12. How to Measure Uncertainty in Uncertainty Sampling for Active Learning. Mach. Learn., 111(1):89–122.
  13. OpenAI. 2023. GPT-4 technical report. ArXiv, abs/2303.08774.
  14. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
  15. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
  16. Burr Settles. 2012. Active Learning. Morgan & Claypool Publishers.
  17. Mpnet: Masked and permuted pre-training for language understanding. In NeurIPS 2020. ACM.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com