Dynamic Vocabulary Pruning in Early-Exit LLMs (2410.18952v2)
Abstract: Increasing the size of LLMs has been shown to lead to better performance. However, this comes at the cost of slower and more expensive inference. Early-exiting is a promising approach for improving the efficiency of LLM inference by enabling next token prediction at intermediate layers. Yet, the large vocabulary size in modern LLMs makes the confidence estimation required for exit decisions computationally expensive, diminishing the efficiency gains. To address this, we propose dynamically pruning the vocabulary at test time for each token. Specifically, the vocabulary is pruned at one of the initial layers, and the smaller vocabulary is then used throughout the rest of the forward pass. Our experiments demonstrate that such post-hoc dynamic vocabulary pruning improves the efficiency of confidence estimation in early-exit LLMs while maintaining competitive performance.
- Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding, 2023.
- Beyond efficiency: A systematic survey of resource-efficient large language models. arXiv preprint arXiv:2401.00625, 2024.
- Language models are few-shot learners, 2020.
- Depth-adaptive transformer. arXiv preprint arXiv:1910.10073, 2019.
- Layer skip: Enabling early exit inference and self-speculative decoding. arXiv preprint arXiv:2404.16710, 2024.
- Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. arXiv preprint arXiv:1911.12237, 2019.
- Towards anytime classification in early-exit architectures by enforcing conditional monotonicity. Advances in Neural Information Processing Systems, 36, 2024a.
- Fast yet safe: Early-exiting with risk control. arXiv preprint arXiv:2405.20915, 2024b.
- Full stack optimization of transformer inference: a survey. arXiv preprint arXiv:2302.14017, 2023.
- Green algorithms: quantifying the carbon footprint of computation. Advanced science, 8(12):2100707, 2021.
- Fixing overconfidence in dynamic neural networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2680–2690, 2024.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
- Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
- Confident adaptive language modeling, 2022.
- Scaling laws with vocabulary: Larger models deserve larger vocabularies. arXiv preprint arXiv:2407.13623, 2024.
- Branchynet: Fast inference via early exiting from deep neural networks. In 2016 23rd international conference on pattern recognition (ICPR), pages 2464–2469. IEEE, 2016.
- Accelerating llm inference by enabling intermediate layer decoding. arXiv preprint arXiv:2310.18581, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Will we run out of data? limits of llm scaling based on human-generated data, 2024.
- A survey of resource-efficient llm and multimodal foundation models. arXiv preprint arXiv:2401.08092, 2024.
- A survey on efficient inference for large language models. arXiv preprint arXiv:2404.14294, 2024.