Online Cascade Learning for Efficient Inference over Streams (2402.04513v3)
Abstract: LLMs have a natural role in answering complex queries about data streams, but the high computational cost of LLM inference makes them infeasible in many such tasks. We propose online cascade learning, the first approach to address this challenge. The objective here is to learn a "cascade" of models, starting with lower-capacity models (such as logistic regression) and ending with a powerful LLM, along with a deferral policy that determines the model to be used on a given input. We formulate the task of learning cascades online as an imitation-learning problem, where smaller models are updated over time imitating the collected LLM demonstrations, and give a no-regret algorithm for the problem. Experimental results across four benchmarks show that our method parallels LLMs in accuracy while cutting down inference costs by as much as 90% with strong robustness against input distribution shifts, underscoring its efficacy and adaptability in stream processing.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Frugalgpt: How to use large language models while reducing cost and improving performance. CoRR, abs/2305.05176, 2023. doi: 10.48550/arXiv.2305.05176. URL https://doi.org/10.48550/arXiv.2305.05176.
- On the efficacy of knowledge distillation. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 4794–4802, 2019.
- Hate Speech Dataset from a White Supremacy Forum. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), pp. 11–20, Brussels, Belgium, October 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5102. URL https://www.aclweb.org/anthology/W18-5102.
- Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543, 2023.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Online learning: A comprehensive survey. Neurocomputing, 459:249–289, 2021.
- Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301, 2023.
- When does confidence-based cascade deferral suffice? CoRR, abs/2307.02764, 2023. doi: 10.48550/arXiv.2307.02764. URL https://doi.org/10.48550/arXiv.2307.02764.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pp. 4171–4186, 2019.
- Babybear: Cheap inference triage for expensive language models. arXiv preprint arXiv:2205.11747, 2022.
- Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp. 19274–19286. PMLR, 2023.
- Cascadebert: Accelerating inference of pre-trained language models via calibrated complete models cascade. arXiv preprint arXiv:2012.14682, 2020.
- On the convergence of stochastic gradient descent with adaptive stepsizes. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 983–992. PMLR, 2019.
- Fastbert: a self-distilling bert with adaptive inference time. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6035–6044, 2020.
- Online speculative decoding. arXiv preprint arXiv:2310.07177, 2023.
- Learning word vectors for sentiment analysis. In Lin, D., Matsumoto, Y., and Mihalcea, R. (eds.), The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA, pp. 142–150. The Association for Computer Linguistics, 2011. URL https://aclanthology.org/P11-1015/.
- Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 2023.
- A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635. JMLR Workshop and Conference Proceedings, 2011.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
- The right tool for the job: Matching model and instance complexities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6640–6651, 2020.
- Universality versus cultural specificity of three emotion domains: Some evidence based on the cascading model of emotional intelligence. Journal of Cross-Cultural Psychology, 46(2):229–251, 2015.
- Cache me if you can: an online cost-aware teacher-student framework to reduce the calls to large language models. arXiv preprint arXiv:2310.13395, 2023.
- Fever: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355, 2018.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Model cascading: Towards jointly improving efficiency and accuracy of NLP systems. In Goldberg, Y., Kozareva, Z., and Zhang, Y. (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp. 11007–11021. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.emnlp-main.756. URL https://doi.org/10.18653/v1/2022.emnlp-main.756.
- Wisdom of committees: An overlooked approach to faster and more accurate models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=MvO2t0vbs4-.
- Deebert: Dynamic early exiting for accelerating bert inference. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2246–2251, 2020.
- Small models are valuable plug-ins for large language models. arXiv preprint arXiv:2305.08848, 2023.
- Lifting the curse of capacity gap in distilling large language models. 2022.
- Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (icml-03), pp. 928–936, 2003.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run paper prompts using GPT-5.