CEEBERT: Cross-Domain Inference in Early Exit BERT (2405.15039v1)
Abstract: Pre-trained LLMs (PLMs), like BERT, with self-supervision objectives exhibit remarkable performance and generalization across various tasks. However, they suffer in inference latency due to their large size. To address this issue, side branches are attached at intermediate layers, enabling early inference of samples without requiring them to pass through all layers. However, the challenge is to decide which layer to infer and exit each sample so that the accuracy and latency are balanced. Moreover, the distribution of the samples to be inferred may differ from that used for training necessitating cross-domain adaptation. We propose an online learning algorithm named Cross-Domain Inference in Early Exit BERT (CeeBERT) that dynamically determines early exits of samples based on the level of confidence at each exit point. CeeBERT learns optimal thresholds from domain-specific confidence observed at intermediate layers on the fly, eliminating the need for labeled data. Experimental results on five distinct datasets with BERT and ALBERT models demonstrate CeeBERT's ability to improve latency by reducing unnecessary computations with minimal drop in performance. By adapting to the threshold values, CeeBERT can speed up the BERT/ALBERT models by $2\times$ - $3.5\times$ with minimal drop in accuracy.
- Nabiha Asghar. 2016. Yelp dataset challenge: Review rating prediction. arXiv preprint arXiv:1605.05362.
- Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256.
- Peter Auer et al. 2002b. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235–256.
- Nikita Balagansky and Daniil Gavrilov. 2022. Palbert: Teaching albert to ponder. Advances in Neural Information Processing Systems, 35:14002–14012.
- Pondernet: Learning to ponder. arXiv preprint arXiv:2107.05407.
- Controlling computation versus quality for neural sequence models. arXiv preprint arXiv:2002.07106.
- Epnet: Learning to exit with flexible multi-branch network. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 235–244.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Depth-adaptive transformer. In In Proc. of ICLR.
- Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556.
- Flexdnn: Input-adaptive on-device deep learning for efficient mobile vision. In 2020 IEEE/ACM Symposium on Edge Computing (SEC), pages 84–95. IEEE.
- Deecap: Dynamic early exiting for efficient image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12216–12226.
- Dynamic perceiver for efficient visual recognition. arXiv preprint arXiv:2306.11248.
- Unsupervised early exit in dnns with multiple exits. arXiv preprint arXiv:2209.09480.
- Magic pyramid: Accelerating inference with early exiting and token pruning. arXiv preprint arXiv:2111.00230.
- Multi-scale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844.
- Early exit with disentangled representation and equiangular tight frame. In Findings of the Association for Computational Linguistics: ACL 2023, pages 14128–14142.
- Dynamic early exit scheduling for deep neural network inference through contextual bandits. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 823–832.
- Learning early exit for deep neural network inference on mobile devices through multi-armed bandits. In 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid), pages 11–20. IEEE.
- Shallow-deep networks: Understanding and mitigating network overthinking. In International conference on machine learning, pages 3301–3310. PMLR.
- Geonho Kim and Jongsun Park. 2020. Low cost early exit decision unit design for cnn accelerator. In 2020 International SoC Design Conference (ISOCC), pages 127–128. IEEE.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
- Spinn: synergistic progressive inference of neural networks over device and cloud. In Proceedings of the 26th annual international conference on mobile computing and networking, pages 1–15.
- Edge ai: On-demand accelerating deep neural network inference via edge computing. IEEE Transactions on Wireless Communications, 19(1):447–457.
- Fastbert: a self-distilling bert with adaptive inference time. arXiv preprint arXiv:2004.02178.
- Towards efficient nlp: A standard evaluation and a strong baseline. arXiv preprint arXiv:2110.07038.
- Towards efficient NLP: A standard evaluation and A strong baseline.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Are sixteen heads really better than one? Advances in neural information processing systems, 32.
- Calibration-aided edge inference offloading via adaptive model partitioning of deep neural networks. In ICC 2021-IEEE International Conference on Communications, pages 1–6. IEEE.
- Deep contextualized word representations. arxiv. arXiv.
- Pushpankar Kumar Pushp and Muktabh Mayank Srivastava. 2017. Train once, test anywhere: Zero-shot learning for text classification. arXiv preprint arXiv:1712.05972.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Consistent accelerated inference via confident adaptive transformers. arXiv preprint arXiv:2104.08803.
- The right tool for the job: Matching model and instance complexities. arXiv preprint arXiv:2004.07453.
- A simple hash-based early exiting approach for language understanding and generation. arXiv preprint arXiv:2203.01670.
- You need multiple exiting: Dynamic early exiting for accelerating unified vision language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10781–10791.
- Branchynet: Fast inference via early exiting from deep neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR), pages 2464–2469. IEEE.
- Attention is all you need. Advances in neural information processing systems, 30.
- Online algorithm for unsupervised sensor selection. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, pages 3168–3176. PMLR.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In the Proceedings of ICLR.
- Dynexit: A dynamic early-exit strategy for deep residual networks. In 2019 IEEE International Workshop on Signal Processing Systems (SiPS), pages 178–183. IEEE.
- Are the bert family zero-shot learners? a study on their potential and limitations. Artificial Intelligence, page 103953.
- See: Scheduling early exit for mobile dnn inference during service outage. In Proceedings of the 22nd International ACM Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems, pages 279–288.
- Deebert: Dynamic early exiting for accelerating bert inference. arXiv preprint arXiv:2004.12993.
- DeeBERT: Dynamic early exiting for accelerating BERT inference. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2246–2251. Association for Computational Linguistics.
- Berxit: Early exiting for bert with better fine-tuning and extension to regression. In Proceedings of the 16th conference of the European chapter of the association for computational linguistics: Main Volume, pages 91–104.
- Resolution adaptive networks for efficient inference. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2369–2378.
- Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32.
- Bert loses patience: Fast and robust inference with early exit. Advances in Neural Information Processing Systems, 33:18330–18341.
- Wei Zhu. 2021. Leebert: Learned early exit for bert with cross-level optimization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2968–2980.
- Divya Jyoti Bajpai (10 papers)
- Manjesh Kumar Hanawal (23 papers)