$C^3$: Confidence Calibration Model Cascade for Inference-Efficient Cross-Lingual Natural Language Understanding (2402.15991v1)
Abstract: Cross-lingual natural language understanding (NLU) is a critical task in NLP. Recent advancements have seen multilingual pre-trained LLMs (mPLMs) significantly enhance the performance of these tasks. However, mPLMs necessitate substantial resources and incur high computational costs during inference, posing challenges for deployment in real-world and real-time systems. Existing model cascade methods seek to enhance inference efficiency by greedily selecting the lightest model capable of processing the current input from a variety of models, based on model confidence scores. Nonetheless, deep models tend to exhibit overconfidence, and confidence distributions vary across languages. This leads to the emission of confident but incorrect predictions by smaller models, hindering their ability to generalize effectively across test languages. In this study, we introduce a confidence calibration model cascade ($C3$) method. This approach, simple yet effective, involves calibration prior to cascade inference, thereby enhancing cascade accuracy through more reliable predictions. Extensive experiments conducted on three cross-lingual benchmarks demonstrate that $C3$ significantly outperforms all state-of-the-art baselines.
- Binarybert: Pushing the limit of bert quantization. arXiv preprint arXiv:2012.15701.
- The lottery ticket hypothesis for pre-trained bert networks. Advances in neural information processing systems, 33:15834–15846.
- Conservative prediction via data-driven confidence minimization. arXiv preprint arXiv:2306.04974.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
- Xnli: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Shifting attention to relevance: Towards the uncertainty estimation of large language models. arXiv preprint arXiv:2307.01379.
- Jonathan Frankle and Michael Carbin. 2018. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635.
- Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543.
- On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR.
- Selective learning: Towards robust calibration with dynamic regularization. arXiv preprint arXiv:2402.08384.
- TinyBERT: Distilling BERT for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174, Online. Association for Computational Linguistics.
- Shallow-deep networks: Understanding and mitigating network overthinking. In International conference on machine learning, pages 3301–3310. PMLR.
- Cascadebert: Accelerating inference of pre-trained language models via calibrated complete models cascade. arXiv preprint arXiv:2012.14682.
- Accelerating bert inference for sequence labeling via early-exit. arXiv preprint arXiv:2105.13878.
- Less is more: Task-aware layer-wise distillation for language model compression. In International Conference on Machine Learning, pages 20852–20867. PMLR.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
- Xglue: A new benchmark dataset for cross-lingual pre-training, understanding and generation. arXiv, abs/2004.01401.
- A global past-future early exit method for accelerating inference of pre-trained language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2013–2023, Online. Association for Computational Linguistics.
- FastBERT: a self-distilling bert with adaptive inference time. In Proceedings of ACL 2020.
- Roberta: A robustly optimized bert pretraining approach. In International Conference on Learning Representations.
- Ebert: Efficient bert inference with dynamic structured pruning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4814–4823.
- Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In International Conference on Learning Representations (ICLR).
- nuqmm: Quantized matmul for efficient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557.
- How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996–5001, Florence, Italy. Association for Computational Linguistics.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
- Distilling reasoning capabilities into smaller language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7059–7073.
- Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355.
- MobileBERT: a compact task-agnostic BERT for resource-limited devices. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2158–2170, Online. Association for Computational Linguistics.
- Mkq-bert: Quantized bert with 4-bits weights and activations. arXiv preprint arXiv:2203.13483.
- Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. arXiv preprint arXiv:2305.14975.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Paul Viola and Michael Jones. 2001. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001, volume 1, pages I–I. Ieee.
- Augmax: Adversarial composition of random augmentations for robust training. Advances in neural information processing systems, 34:237–250.
- Macedon: Minimizing representation coding rate reduction for cross-lingual natural language understanding. In The 2023 Conference on Empirical Methods in Natural Language Processing.
- Wisdom of committees: An overlooked approach to faster and more accurate models. In International Conference on Learning Representations.
- Mitigating neural network overconfidence with logit normalization. In International Conference on Machine Learning, pages 23631–23644. PMLR.
- Guoxuan Xia and Christos-Savvas Bouganis. 2023. Window-based early-exit cascades for uncertainty estimation: When deep ensembles are more efficient than single models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
- Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR.
- DeeBERT: Dynamic early exiting for accelerating BERT inference. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2246–2251, Online. Association for Computational Linguistics.
- Berxit: Early exiting for bert with better fine-tuning and extension to regression. In Proceedings of the 16th conference of the European chapter of the association for computational linguistics: Main Volume, pages 91–104.
- PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification. In Proc. of EMNLP.
- Q8bert: Quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), pages 36–39. IEEE.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
- Bert loses patience: Fast and robust inference with early exit. In Advances in Neural Information Processing Systems, volume 33, pages 18330–18341. Curran Associates, Inc.
- Wei Zhu. 2021. Leebert: Learned early exit for bert with cross-level optimization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2968–2980.