Papers
Topics
Authors
Recent
Search
2000 character limit reached

Layer-wise Regularized Dropout for Neural Language Models

Published 26 Feb 2024 in cs.CL and cs.AI | (2402.16361v1)

Abstract: Among the various pre-trained neural LLMs that are popular today, dropout is already an indispensable regularization technique. To solve the inconsistency between training and inference caused by the randomness of dropout, some studies use consistency training to regularize dropout at the output layer. In this paper, we propose a novel Layer-wise Regularized Dropout (LR-Drop), which is specially designed for Transformer-based LLMs. Specifically, LR-Drop layer-wise regularizes each Transformer layer using the consistency training strategy. Each training sample passes through the two siamese sub-models sampled by dropout, and then LR-Drop forces the hidden states, multi-head attention matrices, and output distribution of the two siamese sub-models to be consistent. The proposed LR-Drop can be regarded as a "self-distillation" framework, in which each sub-model generated by dropout is the other's "teacher" model and "student" model. Through extensive experiments on 8 natural language understanding datasets, 6 neural machine translation datasets, and 1 abstractive summarization dataset (a total of 15 datasets), we show that LR-Drop achieves superior performances, including state-of-the-art results.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Better fine-tuning by reducing representational collapse. In International Conference on Learning Representations.
  2. Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007). Association for Computational Linguistics.
  3. Jimmy Ba and Brendan Frey. 2013. Adaptive dropout for training deep neural networks. Advances in neural information processing systems, 26:3084–3092.
  4. Layer normalization. arXiv preprint arXiv:1607.06450.
  5. The second PASCAL recognising textual entailment challenge.
  6. What does bert look at? an analysis of bert’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276–286.
  7. Electra: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations.
  8. The PASCAL recognising textual entailment challenge. In Machine learning challenges. evaluating predictive uncertainty, visual object classification, and recognising tectual entailment. Springer.
  9. William B Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the International Workshop on Paraphrasing.
  10. Born again neural networks. In International Conference on Machine Learning, pages 1607–1616. PMLR.
  11. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing.
  12. Explaining and harnessing adversarial examples. ICLR.
  13. Distilling the knowledge in a neural network.
  14. Sepp Hochreiter and Jürgen Schmidhuber. 1995. Simplifying neural nets by discovering flat minima. In Advances in neural information processing systems, pages 529–536.
  15. Do we need zero training loss after achieving zero training error? In International Conference on Machine Learning, pages 4604–4614. PMLR.
  16. First quora dataset release: Question pairs.
  17. Tinybert: Distilling bert for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174.
  18. Shakeout: A new regularized deep neural network training scheme. In Thirtieth AAAI Conference on Artificial Intelligence.
  19. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
  20. On large-batch training for deep learning: Generalization gap and sharp minima. In 5th International Conference on Learning Representations, ICLR 2017.
  21. Anders Krogh and John A Hertz. 1992. A simple weight decay can improve generalization. In Advances in neural information processing systems, pages 950–957.
  22. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
  23. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880.
  24. Regularization via structural label smoothing. In International Conference on Artificial Intelligence and Statistics, pages 1453–1463. PMLR.
  25. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  26. Dropout with expectation-linear regularization. arXiv preprint arXiv:1609.08017.
  27. Self-distillation amplifies regularization in hilbert space. Advances in Neural Information Processing Systems, 33:3351–3361.
  28. When does label smoothing help? arXiv preprint arXiv:1906.02629.
  29. Shiwen Ni and Hung-Yu Kao. 2023. Masked siamese prompt tuning for few-shot natural language understanding. IEEE Transactions on Artificial Intelligence.
  30. Dropattack: A masked weight adversarial training method to improve generalization of neural networks. arXiv preprint arXiv:2108.12805.
  31. Hat4rd: Hierarchical adversarial training for rumor detection in social media. Sensors, 22(17):6652.
  32. R-at: Regularized adversarial training for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6427–6440.
  33. Analyzing noise in autoencoders and deep networks. arXiv preprint arXiv:1406.1831.
  34. Prophetnet: Predicting future n-gram for sequence-to-sequencepre-training. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2401–2410.
  35. SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP.
  36. Tim Salimans and Durk P Kingma. 2016. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in neural information processing systems, 29:901–909.
  37. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP.
  38. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958.
  39. Attention is all you need. Advances in neural information processing systems, 30.
  40. Regularization of neural networks using dropconnect. In International conference on machine learning, pages 1058–1066. PMLR.
  41. Neural network acceptability judgments. arXiv preprint 1805.12471.
  42. A broad-coverage challenge corpus for sentence understanding through inference. In NAACL.
  43. R-drop: Regularized dropout for neural networks. Advances in Neural Information Processing Systems, 34:10890–10905.
  44. Yuxin Wu and Kaiming He. 2018. Group normalization. In Proceedings of the European conference on computer vision (ECCV), pages 3–19.
  45. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32.
  46. Delving deep into label smoothing. IEEE Transactions on Image Processing, 30:5984–5996.
  47. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning, pages 11328–11339. PMLR.
  48. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3713–3722.
  49. Zhilu Zhang and Mert Sabuncu. 2020. Self-distillation as instance-specific label smoothing. Advances in Neural Information Processing Systems, 33:2184–2195.
  50. Freelb: Enhanced adversarial training for natural language understanding. In ICLR.
  51. Fraternal dropout. In International Conference on Learning Representations.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.