Distill or Annotate? Cost-Efficient Fine-Tuning of Compact Models (2305.01645v3)
Abstract: Fine-tuning large models is highly effective, however, inference can be expensive and produces carbon emissions. Knowledge distillation has been shown to be a practical solution to reduce inference costs, but the distillation process itself requires significant computational resources. Rather than buying or renting GPUs to fine-tune, then distill a large model, an NLP practitioner might instead choose to allocate the available budget to hire annotators and manually label additional fine-tuning data. In this paper, we investigate how to most efficiently use a fixed budget to build a compact model. Through extensive experiments on six diverse tasks, we show that distilling from T5-XXL (11B) to T5-Small (60M) is almost always a cost-efficient strategy compared to annotating more data to directly train a compact model (T5-Small). We further investigate how the optimal budget allocated towards computation varies across scenarios. We will make our code, datasets, annotation cost estimates, and baseline models available as a benchmark to support further work on cost-efficient training of compact models.
- Pareto-optimal quantized resnet is mostly 4-bit. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3091–3099.
- Why knowledge distillation amplifies gender bias and how to mitigate from the perspective of DistilBERT. In Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pages 266–272, Seattle, Washington. Association for Computational Linguistics.
- Pre-train or annotate? domain adaptation with a constrained budget. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5002–5015, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Gary S. Becker. 1965. A theory of the allocation of time. The Economic Journal, 75(299):493–517.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Clean or annotate: How to spend a limited data collection budget. In Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pages 152–168.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
- Improving large-scale paraphrase acquisition and generation. In In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819.
- Retrieval augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3929–3938. PMLR.
- Cuad: An expert-annotated nlp dataset for legal contract review. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1.
- Scaling laws for transfer. arXiv preprint arXiv:2102.01293.
- Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop.
- Handling anomalies of synthetic questions in unsupervised question answering. In Proceedings of the 28th International Conference on Computational Linguistics, pages 3441–3448, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Characterising bias in compressed models.
- Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes.
- How to train BERT with an academic budget. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10644–10652, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Tinybert: Distilling bert for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174.
- Knowledge distillation for sustainable neural machine translation. In Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas (Volume 2: Users and Providers Track and Government Track), pages 221–230.
- Regularization of distinct strategies for unsupervised question generation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3266–3277, Online. Association for Computational Linguistics.
- A few more examples may be worth billions of parameters. In Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
- Kelvin J. Lancaster. 1966. A new approach to consumer theory. Journal of Political Economy, 74(2):132–157.
- Few-shot anaphora resolution in scientific protocols via mixtures of in-context experts. In Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Language models as fact checkers? In Proceedings of the Third Workshop on Fact Extraction and VERification (FEVER), pages 36–41, Online. Association for Computational Linguistics.
- The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Knowledge distillation with reptile meta-learning for pretrained language model compression. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4907–4917, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Teaching small language models to reason. arXiv preprint arXiv:2212.08410.
- Are sixteen heads really better than one? In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- Noisy channel language model prompting for few-shot text classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5316–5330, Dublin, Ireland. Association for Computational Linguistics.
- WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3992–4006, Seattle, United States. Association for Computational Linguistics.
- MS MARCO: A human generated machine reading comprehension dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016, volume 1773 of CEUR Workshop Proceedings. CEUR-WS.org.
- Partial or complete, that’s the question. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2190–2200, Minneapolis, Minnesota. Association for Computational Linguistics.
- Unsupervised paraphrasing with pretrained language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5136–5150, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Johan S Obando-Ceron and Pablo Samuel Castro. 2021. Revisiting rainbow: Promoting more insightful and inclusive deep reinforcement learning research. In Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems.
- Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350.
- KILT: a benchmark for knowledge intensive language tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2523–2544, Online. Association for Computational Linguistics.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
- How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5418–5426, Online. Association for Computational Linguistics.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. In NeurIPS EMC2̂ Workshop.
- Get your vitamin C! robust fact verification with contrastive evidence. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 624–643, Online. Association for Computational Linguistics.
- Towards a comprehensive understanding and accurate evaluation of societal biases in pre-trained transformers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2383–2389, Online. Association for Computational Linguistics.
- Does knowledge distillation really work? Advances in Neural Information Processing Systems, 34.
- Principle-driven self-alignment of language models from scratch with minimal human supervision.
- WNUT-2020 task 1 overview: Extracting entities and relations from wet lab protocols. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pages 260–267, Online. Association for Computational Linguistics.
- FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics.
- Predicting attention sparsity in transformers. In Proceedings of the Sixth Workshop on Structured Prediction for NLP, pages 67–81.
- Want to reduce labeling cost? GPT-3 can help. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4195–4205, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
- Huggingface’s transformers: State-of-the-art natural language processing.
- Structured pruning learns compact and accurate models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1513–1528, Dublin, Ireland. Association for Computational Linguistics.
- Canwen Xu and Julian McAuley. 2022. A survey on model compression and acceleration for pretrained language models.
- Sparse distillation: Speeding up text classification by using bigger student models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2361–2375.
- Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3713–3722.
- Learning with different amounts of annotation: From zero to many labels. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7620–7632.
- Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
- Calibrate before use: Improving few-shot performance of language models. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 12697–12706. PMLR.
- Stanceosaurus: Classifying stance towards multilingual misinformation. In In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Prompt consistency for zero-shot task generalization.
- Bert learns to teach: Knowledge distillation with meta learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7037–7049.