Agreement-Based Cascading for Efficient Inference (2407.02348v2)
Abstract: Adaptive inference schemes reduce the cost of machine learning inference by assigning smaller models to easier examples, attempting to avoid invocation of larger models when possible. In this work we explore a simple, effective adaptive inference technique we term Agreement-Based Cascading (ABC). ABC builds a cascade of models of increasing size/complexity, and uses agreement between ensembles of models at each level of the cascade as a basis for data-dependent routing. Although ensemble execution introduces additional expense, we show that these costs can be easily offset in practice due to large expected differences in model sizes, parallel inference execution capabilities, and accuracy benefits of ensembling. We examine ABC theoretically and empirically in terms of these parameters, showing that the approach can reliably act as a drop-in replacement for existing models and surpass the best single model it aims to replace in terms of both efficiency and accuracy. Additionally, we explore the performance of ABC relative to existing cascading methods in three common scenarios: (1) edge-to-cloud inference, where ABC reduces communication costs by up to 14x; (2) cloud-based model serving, where it achieves a 3x reduction in rental costs; and (3) inference via model API services, where ABC achieves a 2-25x reduction in average price per token/request relative to state-of-the-art LLM cascades.
- Real-time pedestrian detection with deep network cascades. volume 2, page 4, 2015.
- L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, Aug. 1996. ISSN 1573-0565. doi: 10.1007/BF00058655.
- Once-for-All: Train One Network and Specialize it for Efficient Deployment, Apr. 2019.
- Learning Complexity-Aware Cascades for Deep Pedestrian Detection. pages 3361–3369, 2015.
- FrugalML: How to use ML Prediction APIs more accurately and cheaply. In Advances in Neural Information Processing Systems, 2020.
- FrugalGPT: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023.
- Autoformer: Searching transformers for visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12270–12280, October 2021.
- Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555, 2020.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
- Progressive ensemble distillation: Building ensembles for efficient inference. Advances in Neural Information Processing Systems, 36, 2023.
- Everybody prune now: Structured pruning of llms with only forward passes. arXiv preprint arXiv:2402.05406, 2024.
- Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318–30332, 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019.
- MatFormer: Nested Transformer for Elastic Inference, Oct. 2023.
- T. G. Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pages 1–15. Springer, 2000.
- Y. Du and L. Kaelbling. Compositional Generative Modeling: A Single Model is Not All You Need, Feb. 2024. arXiv:2402.01103 [cs].
- S. Džeroski and B. Ženko. Is combining classifiers with stacking better than selecting the best one? Machine learning, 54:255–273, 2004.
- S. Enomoro and T. Eda. Learning to cascade: Confidence calibration for improving the accuracy and computational cost of cascade inference systems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7331–7339, 2021.
- A. Fern and R. Givan. Online ensemble learning: An empirical study. Machine Learning, 53:71–109, 2003.
- Y. Freund and R. E. Schapire. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1):119–139, Aug. 1997. ISSN 0022-0000. doi: 10.1006/jcss.1997.1504.
- Selective Classification via One-Sided Prediction. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, pages 2179–2187. PMLR, Mar. 2021. ISSN: 2640-3498.
- Cascade ensembles. In Computational Intelligence and Bioinspired Systems: 8th International Work-Conference on Artificial Neural Networks. Proceedings 8, pages 598–603. Springer, 2005.
- Y. Geifman and R. El-Yaniv. SelectiveNet: A Deep Neural Network with an Integrated Reject Option. In Proceedings of the 36th International Conference on Machine Learning, pages 2151–2159. PMLR, May 2019. ISSN: 2640-3498.
- No One Representation to Rule Them All: Overlapping Features of Training Methods, Apr. 2022. arXiv:2110.12899 [cs].
- Energy-efficient Amortized Inference with Cascaded Deep Classifiers. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, pages 2184–2190, Stockholm, Sweden, July 2018. International Joint Conferences on Artificial Intelligence Organization. ISBN 978-0-9992411-2-7. doi: 10.24963/ijcai.2018/302.
- On calibration of modern neural networks. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1321–1330. PMLR, 06–11 Aug 2017.
- Language model cascades: Token-level uncertainty and beyond. In International Conference on Learning Representations, 2024.
- Dynamic Neural Networks: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7436–7456, Nov. 2022. ISSN 1939-3539. doi: 10.1109/TPAMI.2021.3117837.
- DynaBERT: Dynamic BERT with Adaptive Width and Depth. In Advances in Neural Information Processing Systems, volume 33, pages 9782–9793. Curran Associates, Inc., 2020.
- Multi-Scale Dense Networks for Resource Efficient Image Classification, June 2018. arXiv:1703.09844.
- When Does Confidence-Based Cascade Deferral Suffice? Advances in Neural Information Processing Systems, 36:9891–9906, Dec. 2023.
- Scaling laws for neural language models. ArXiv, abs/2001.08361, 2020.
- SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads, Dec. 2023. arXiv:2312.16733 [cs].
- An llm compiler for parallel function calling. ArXiv, abs/2312.04511, 2023.
- Willump: A statistically-aware end-to-end optimizer for machine learning inference. Proceedings of Machine Learning and Systems, 2:147–159, 2020.
- A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
- Fedscale: Benchmarking model and system performance of federated learning at scale. In International conference on machine learning, pages 11814–11827. PMLR, 2022.
- Lambda. GPU Cloud - VMs for Deep Learning | Lambda, 2024. URL https://lambdalabs.com/service/gpu-cloud.
- Efficient Inference With Model Cascades. Transactions on Machine Learning Research, 2023.
- CascadeBERT: Accelerating Inference of Pre-trained Language Models via Calibrated Complete Models Cascade. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 475–486, Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.43.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- AutoMix: Automatically Mixing Language Models, Oct. 2023. arXiv:2310.12963 [cs].
- TangoBERT: Reducing Inference Cost by using Cascaded Architecture, Apr. 2022. arXiv:2204.06271 [cs].
- Towards efficient generative large language model serving: A survey from algorithms to systems. ArXiv, abs/2312.15234, 2023.
- Post-hoc estimators for learning to defer to an expert. Advances in Neural Information Processing Systems, 35:29292–29304, Dec. 2022.
- Online cascade learning for efficient inference over streams. arXiv preprint arXiv:2402.04513, 2024.
- Neural network-based face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):23–38, Jan. 1998. ISSN 1939-3539. doi: 10.1109/34.655647.
- Confident Adaptive Language Modeling. Advances in Neural Information Processing Systems, 35:17456–17472, Dec. 2022.
- A. J. C. Sharkey. On combining artificial neural nets. Connection Science, 8(3-4):299–314, 1996. doi: 10.1080/095400996116785.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013.
- S. Soo. Object detection using Haar-cascade Classifier. Institute of Computer Science, University of Tartu, 2(3):1–12, 2014.
- M. Streeter. Approximation Algorithms for Cascading Prediction Models. In Proceedings of the 35th International Conference on Machine Learning, pages 4752–4760. PMLR, July 2018. ISSN: 2640-3498.
- Energy and policy considerations for modern deep learning research. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13693–13696, 2020.
- Generalized boosting. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 8787–8797, 2020.
- Adaptive cascade of boosted ensembles for face detection in concept drift. Neural Computing and Applications, 21:671–682, 2012.
- N. Varshney and C. Baral. Model Cascading: Towards Jointly Improving Efficiency and Accuracy of NLP Systems. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11007–11021, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.756.
- P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, volume 1, Dec. 2001. doi: 10.1109/CVPR.2001.990517. ISSN: 1063-6919.
- P. Viola and M. J. Jones. Robust Real-Time Face Detection. International Journal of Computer Vision, 57(2):137–154, May 2004. ISSN 1573-1405. doi: 10.1023/B:VISI.0000013087.49260.fb.
- Fusing models with complementary expertise. arXiv preprint arXiv:2310.01542, 2023.
- A cascade ranking model for efficient ranked retrieval. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 105–114. ACM, 2011. doi: 10.1145/2009916.2009934.
- IDK Cascades: Fast Deep Learning by Learning not to Overthink, June 2018a. arXiv:1706.00885 [cs].
- SkipNet: Learning Dynamic Routing in Convolutional Networks. pages 409–424, 2018b.
- Wisdom of committees: An overlooked approach to faster and more accurate models. In International Conference on Learning Representations, 2021.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
- Deebert: Dynamic early exiting for accelerating bert inference. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2246–2251, 2020.
- Xlnet: Generalized autoregressive pretraining for language understanding. In Neural Information Processing Systems, 2019.
- J. Yu and T. S. Huang. Universally Slimmable Networks and Improved Training Techniques. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1803–1811, 2019.
- Slimmable Neural Networks, Dec. 2018. arXiv:1812.08928 [cs].
- Large language model cascades with mixture of thought representations for cost-efficient reasoning. In The Twelfth International Conference on Learning Representations, 2024.
- SWAG: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 93–104, Oct.-Nov. 2018. doi: 10.18653/v1/D18-1009.
- Delayed gradient averaging: Tolerate the communication latency for federated learning. Advances in Neural Information Processing Systems, 34:29995–30007, 2021.
- F. Zuo and P. H. N. de With. Fast face detection using a cascade of neural network ensembles. In Advanced Concepts for Intelligent Vision Systems Conference, 2005.
- F. Zuo and P. H. N. de With. Cascaded face detection using neural network ensembles. EURASIP Journal on Advances in Signal Processing, 2008:1–13, 2008.