Papers
Topics
Authors
Recent
Search
2000 character limit reached

Token-Level Fitting Issues of Seq2seq Models

Published 8 May 2023 in cs.CL | (2305.04493v2)

Abstract: Sequence-to-sequence (seq2seq) models have been widely used for natural language processing, computer vision, and other deep learning tasks. We find that seq2seq models trained with early-stopping suffer from issues at the token level. In particular, while some tokens in the vocabulary demonstrate overfitting, others underfit when training is stopped. Experiments show that the phenomena are pervasive in different models, even in fine-tuned large pretrained-models. We identify three major factors that influence token-level fitting, which include token frequency, parts-of-speech, and prediction discrepancy. Further, we find that external factors such as language, model size, domain, data scale, and pretraining can also influence the fitting of tokens.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. A closer look at memorization in deep networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 233–242. PMLR.
  2. Mohammad Mahdi Bejani and Mehdi Ghatee. 2021. A systematic review on overfitting control in shallow and deep neural networks. Artificial Intelligence Review, 54(8):6391–6438.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Jason Brownlee. 2018. Better deep learning: train faster, reduce overfitting, and make better predictions. Machine Learning Mastery.
  5. Satrajit Chatterjee and Piotr Zielinski. 2022. On the generalization mystery in deep learning. arXiv preprint arXiv:2203.10036.
  6. Wilfrid J Dixon and Alexander M Mood. 1946. The statistical sign test. Journal of the American Statistical Association, 41(236):557–566.
  7. Frage: Frequency-agnostic word representation. Advances in neural information processing systems, 31.
  8. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer.
  9. Joseph L Hodges. 1955. A bivariate sign test. The Annals of Mathematical Statistics, 26(3):523–527.
  10. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
  11. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  12. Tom Kocmi and Ondřej Bojar. 2017. Curriculum learning and minibatch bucketing in neural machine translation. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, pages 379–386.
  13. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions, pages 177–180.
  14. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880.
  15. Research on overfitting of deep learning. In 2019 15th International Conference on Computational Intelligence and Security (CIS), pages 78–81. IEEE.
  16. Train big, then compress: Rethinking model size for efficient training and inference of transformers. In International Conference on Machine Learning, pages 5958–5968. PMLR.
  17. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742.
  18. Selective attention for context-aware neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3092–3102.
  19. Competence-based curriculum learning for neural machine translation. In Proceedings of NAACL-HLT, pages 1162–1172.
  20. David MW Powers. 1998. Applications and explanations of zipf’s law. In New methods in language processing and computational natural language learning.
  21. On long-tailed phenomena in neural machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3088–3095.
  22. Overfitting in adversarially robust deep learning. In International Conference on Machine Learning, pages 8093–8104. PMLR.
  23. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
  24. A normalized encoder-decoder model for abstractive summarization using focal loss. In Natural Language Processing and Chinese Computing: 7th CCF International Conference, NLPCC 2018, Hohhot, China, August 26–30, 2018, Proceedings, Part II 7, pages 383–392. Springer.
  25. Machine translation using deep learning: An overview. In 2017 international conference on computer, communications and electronics (comptelix), pages 162–167. IEEE.
  26. Complex structure leads to overfitting: A structure regularization decoding method for natural language processing. arXiv preprint arXiv:1711.10331.
  27. Dusan Varis and Ondřej Bojar. 2021. Sequence length is a domain: Length-based overfitting in transformer models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8246–8257.
  28. Attention is all you need. Advances in neural information processing systems, 30.
  29. Robert Wolfe and Aylin Caliskan. 2021. Low frequency names exhibit bias and overfitting in contextualizing language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 518–532.
  30. Dynamic curriculum learning for low-resource neural machine translation. In Proceedings of the 28th International Conference on Computational Linguistics, pages 3977–3989.
  31. Rare tokens degenerate all tokens: Improving neural text generation via adaptive gradient gating for rare token embeddings. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 29–45.
  32. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations.
  33. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115.
  34. Deep neural networks in machine translation: An overview. IEEE Intell. Syst., 30(5):16–25.
  35. An empirical exploration of curriculum learning for neural machine translation. arXiv preprint arXiv:1811.00739.
  36. Uncertainty-aware curriculum learning for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6934–6944.

Summary

  • The paper reveals token-level fitting issues where high-frequency tokens tend to overfit and low-frequency tokens underfit during training.
  • Extensive experiments on English-German translations show that token frequency, parts-of-speech, and prediction discrepancy significantly affect model performance.
  • The study suggests adaptive fine-tuning strategies to mitigate these issues, offering promising improvements for seq2seq model training.

Token-Level Fitting Issues of Seq2seq Models

Introduction

The paper "Token-Level Fitting Issues of Seq2seq Models" (2305.04493) investigates the fitting challenges that sequence-to-sequence (seq2seq) models encounter at a token level. While seq2seq models are prevalent in NLP and broader AI fields, they face specific problems when training is halted using early stopping. The paper highlights that overfitting and underfitting vary across tokens depending on factors like token frequency, parts-of-speech, and prediction discrepancy. Extensive experiments on English-German translations illustrate that these issues are pervasive and are influenced by external factors such as language, model size, and training data scale (Figure 1). Figure 1

Figure 1

Figure 1

Figure 1: Idealized training and validation loss curves, where the model is selected by early stopping.

Token-Level Fitting Dynamics

Influence of Token Frequency

The research reveals that token frequency significantly impacts fitting, wherein low-frequency tokens commonly underfit and high-frequency tokens tend to overfit. The study utilizes measures such as fitting-offset, where tokens are categorized by their frequency and dissected further to identify the average offset from the early stopping point. The experiments suggest that low-frequency tokens have a higher potential-gain indicating that addressing their underfitting could boost model accuracy (Figure 2). Figure 2

Figure 2: Fitting-offset of tokens grouped by token frequency.

Linguistic Factors: Parts-of-Speech

Exploration into linguistic features, specifically parts-of-speech (POS), reveals differential fitting behavior. Function words generally overfit due to their frequent appearances, whereas nouns, being more content-rich and context-dependent, exhibit underfitting characteristics. By combining frequency with POS, the analysis provides a granular view of fitting issues, highlighting non-trivial patterns like overfitting in function words even within low-frequency categories (Figure 3). Figure 3

Figure 3: Fitting-offset of tokens grouped by parts-of-speech.

Prediction Discrepancy

Introduction of prediction discrepancy, a novel concept in this study, adds another layer of analysis. This factor reflects the dependence of a token's prediction on its context. Tokens with high discrepancies exhibit overfitting, contrasting with those with low discrepancies that underfit, revealing deeper intricacies of fitting that were not captured by frequency alone. This metric unveils new insights by expanding potential-gain estimations when combined with frequency analysis (Figure 4). Figure 4

Figure 4: Fitting-offset of tokens grouped by prediction discrepancy.

Pretrained Model Analysis

Using the pre-trained mBART25 model, the study conducts fine-tuning assessments, further validating the pervasive nature of token-level fitting issues. While pretraining generally mitigates underfitting, these models still struggle with overfitting, notably with function words. This section underlines that despite the initial robustness from large pre-trained models, specific token categories remain vulnerable to fitting issues, requiring careful consideration during model fine-tuning (Figure 5). Figure 5

Figure 5: Fitting-offset of tokens grouped by frequency and parts-of-speech.

Broader Implications and Conclusion

Identifying token-level fitting diversities prompts re-evaluation of both model training strategies and evaluation metrics across different linguistic and contextual frameworks. The study discusses potential improvements, such as adaptive learning strategies, that dynamically target token-specific fitting challenges, offering a promising direction for mitigating these issues. As highlighted, the exploration extends beyond NLP, suggesting broader implications for diverse seq2seq applications in machine translation and beyond.

In conclusion, the investigation into token-level fitting issues reveals complex interactions between linguistic and model-specific factors within seq2seq models, underscoring the sophisticated nature of training dynamics in NLP. Future research could further enhance model efficacy by tailoring methodologies that explicitly address the nuanced fitting concerns illuminated by this study.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.