RAEE: A Robust Retrieval-Augmented Early Exiting Framework for Efficient Inference (2405.15198v2)
Abstract: Deploying LLM inference remains challenging due to their high computational overhead. Early exiting optimizes model inference by adaptively reducing the number of inference layers. Existing methods typically train internal classifiers to determine whether to exit at intermediate layers. However, such classifier-based early exiting frameworks require significant effort to train the classifiers while can only achieve comparable performance at best. To address these limitations, this paper proposes RAEE, a robust Retrieval-Augmented Early Exiting framework for efficient inference. First, this paper demonstrates that the early exiting problem can be modeled as a distribution prediction problem, where the distribution is approximated using similar data's exiting information. Then, this paper details the process of collecting exiting information to build the retrieval database. Finally, based on the pre-built retrieval database, RAEE leverages the retrieved similar data's exiting information to guide the backbone model to exit at the layer, which is predicted by the approximated distribution. Experimental results demonstrate that the proposed RAEE can significantly accelerate inference. More importantly, RAEE can also achieve a robust zero-shot performance on 8 downstream tasks.
- Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5910–5924, 2023.
- Improving language models by retrieving from trillions of tokens. In Proceedings of the 2022 International Conference on Machine Learning (ICML), pp. 2206–2240, 2022.
- Lift yourself up: Retrieval-augmented text generation with self-memory. In Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems (NeurIPS), 2023.
- Retrieval-augmented multiple instance learning. In Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023 (NeurIPS), 2023.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems (NeurIPS), 2022.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pp. 4171–4186. Association for Computational Linguistics, 2019.
- Not all layers of llms are necessary during inference. CoRR, abs/2403.02181, 2024.
- Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP), pp. 3816–3830. Association for Computational Linguistics, 2021.
- Retrieval augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning (ICML), pp. 3929–3938, 2020.
- Magic pyramid: Accelerating inference with early exiting and token pruning. CoRR, abs/2111.00230, 2021.
- SPEED: speculative pipelined execution for efficient decoding. CoRR, abs/2310.12072, 2023.
- Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL): Main Volume, pp. 874–880, 2021.
- Towards anytime classification in early-exit architectures by enforcing conditional monotonicity. In Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems (NeurIPS), 2023.
- Early exit with disentangled representation and equiangular tight frame. In Findings of the Association for Computational Linguistics (ACL), pp. 14128–14142, 2023.
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
- Generalization through memorization: Nearest neighbor language models. In Proceedings of the 8th International Conference on Learning Representations (ICLR), 2020.
- Nearest neighbor machine translation. In Proceedings of the 9th International Conference on Learning Representations (ICLR), 2021.
- Accelerating inference for pretrained language models by unified multi-perspective early exiting. In Proceedings of the 29th International Conference on Computational Linguistics (COLING), pp. 4677–4686, 2022.
- Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems (NeurIPS), 2020.
- Accelerating BERT inference for sequence labeling via early-exit. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP), pp. 189–199, 2021.
- SEENN: towards temporal spiking early exit neural networks. In Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems (NeurIPS), 2023.
- Decoupled context processing for context augmented language modeling. In Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems (NeurIPS), 2022.
- A global past-future early exit method for accelerating inference of pre-trained language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT): Human Language Technologies, pp. 2013–2023, 2021.
- Fastbert: a self-distilling BERT with adaptive inference time. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 6035–6044, 2020.
- Towards efficient NLP: A standard evaluation and A strong baseline. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies, pp. 3288–3303, 2022.
- Deja vu: Contextual sparsity for efficient llms at inference time. In Proceedings of the 2023 International Conference on Machine Learning (ICML), pp. 22137–22176, 2023.
- Llm-pruner: On the structural pruning of large language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems (NeurIPS), 2023.
- Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108, 2019.
- BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100, 2022.
- A simple hash-based early exiting approach for language understanding and generation. In Findings of the Association for Computational Linguistics (ACL), pp. 2409–2421, 2022.
- Lamda: Language models for dialog applications. CoRR, abs/2201.08239, 2022.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
- Mini-gpts: Efficient large language models through contextual pruning. CoRR, abs/2312.12682, 2023.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 7th International Conference on Learning Representations (ICLR), 2019.
- Training data is more valuable than you think: A simple and effective method by retrieving from training data. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 3170–3179, 2022.
- Augmenting language models with long-term memory. In Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems (NeurIPS), 2023.
- Zero time waste: Recycling predictions in early exit neural networks. In Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems (NeurIPS), pp. 2516–2528, 2021.
- Improving natural language understanding with computation-efficient retrieval representation fusion. CoRR, abs/2401.02993, 2024.
- Deebert: Dynamic early exiting for accelerating BERT inference. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 2246–2251, 2020.
- EARA: improving biomedical semantic textual similarity with entity-aligned attention and retrieval augmentation. In Findings of the Association for Computational Linguistics (EMNLP), pp. 8760–8771, 2023.
- LECO: improving early exiting via learned exits and comparison-based exiting mechanism. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL): Student Research Workshop, pp. 298–309, 2023a.
- Redi: Efficient learning-free diffusion inference via trajectory retrieval. In Proceedings of the 2023 International Conference on Machine Learning (ICML), pp. 41770–41785, 2023b.
- BERT loses patience: Fast and robust inference with early exit. In Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems (NeurIPS), 2020.
- Wei Zhu. Leebert: Learned early exit for BERT with cross-level optimization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP), pp. 2968–2980, 2021.
- GAML-BERT: improving BERT early exiting by gradient aligned mutual learning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3033–3044, 2021.
- BADGE: speeding up BERT inference after deployment via block-wise bypasses and divergence-based early exiting. In Proceedings of the The 61st Annual Meeting of the Association for Computational Linguistics (ACL): Industry Track, pp. 500–509, 2023.