Papers
Topics
Authors
Recent
2000 character limit reached

SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection (2401.13160v1)

Published 24 Jan 2024 in cs.LG and cs.CL

Abstract: Pre-training LLMs is known to be extremely resource intensive and often times inefficient, under-utilizing the information encapsulated in the training text sequences. In this paper, we present SpacTor, a new training procedure consisting of (1) a hybrid objective combining span corruption (SC) and token replacement detection (RTD), and (2) a two-stage curriculum that optimizes the hybrid objective over the initial $\tau$ iterations, then transitions to standard SC loss. We show empirically that the effectiveness of the hybrid objective is tied to the two-stage pre-training schedule, and provide extensive analysis on why this is the case. In our experiments with encoder-decoder architectures (T5) on a variety of NLP tasks, SpacTor-T5 yields the same downstream performance as standard SC pre-training, while enabling a 50% reduction in pre-training iterations and 40% reduction in total FLOPs. Alternatively, given the same amount of computing budget, we find that SpacTor results in significantly improved downstream benchmark performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Ext5: Towards extreme multi-task scaling for transfer learning. In International Conference on Learning Representations.
  2. Metro: Efficient denoising pretraining of large scale autoencoding language models with model generated signals. arXiv preprint arXiv:2204.06644.
  3. Findings of the 2020 conference on machine translation (WMT20). In Proceedings of the Fifth Conference on Machine Translation, pages 1–55, Online. Association for Computational Linguistics.
  4. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48.
  5. A curriculum learning method for improved noise robustness in automatic speech recognition. In 2017 25th European Signal Processing Conference (EUSIPCO), pages 548–552. IEEE.
  6. Language models are few-shot learners. In NeurIPS.
  7. Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311.
  8. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  9. ELECTRA: Pre-training text encoders as discriminators rather than generators. In ICLR.
  10. Semi-supervised sequence learning. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
  11. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  12. Unified language model pre-training for natural language understanding and generation. Advances in Neural Information Processing Systems, 32.
  13. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online. Association for Computational Linguistics.
  14. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations.
  15. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  16. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
  17. Teaching machines to read and comprehend. Advances in neural information processing systems, 28.
  18. Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77.
  19. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  20. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
  21. Generating wikipedia by summarizing long sequences. In International Conference on Learning Representations.
  22. Unicorn on rainbow: A universal commonsense reasoning model on a new multitask benchmark. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13480–13488.
  23. COCO-LM: Correcting and contrasting text sequences for language model pretraining. In Conference on Neural Information Processing Systems.
  24. Pretraining text encoders with adversarial mixture of training signal generators. In International Conference on Learning Representations.
  25. Improving language understanding by generative pre-training. OpenAI blog.
  26. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  27. Exploring the limits of transfer learning with a unified text-to-text transformer. In JMLR.
  28. Squad: 100, 000+ questions for machine comprehension of text. In EMNLP.
  29. Unsupervised pretraining for sequence to sequence learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 383–391, Copenhagen, Denmark. Association for Computational Linguistics.
  30. Scaling up models and data with t5x and seqio. arXiv preprint arXiv:2203.17189.
  31. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
  32. Shazeer, N. (2020). Glu variants improve transformer. arXiv preprint arXiv:2002.05202.
  33. Adafactor: Adaptive learning rates with sublinear memory cost. In Dy, J. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4596–4604. PMLR.
  34. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  35. Unifying language learning paradigms. arXiv preprint arXiv:2205.05131.
  36. Transcending scaling laws with 0.1% extra compute. arXiv preprint arXiv:2210.11399.
  37. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  38. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
  39. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
  40. What language model architecture and pretraining objective work best for zero-shot generalization? arXiv preprint arXiv:2204.05832.
  41. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  42. Emergent abilities of large language models. Transactions on Machine Learning Research.
  43. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
  44. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934.
  45. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  46. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning, pages 11328–11339. PMLR.

Summary

  • The paper presents a hybrid pre-training method that integrates span corruption with replaced token detection to maintain high task performance while reducing computation.
  • The methodology uses an auxiliary generator for token replacement and shifts from dual-objective training to span corruption alone for optimal efficiency.
  • Empirical evaluation shows a 50% reduction in pre-training iterations and 40% lower FLOPs, verifying the method's efficiency on downstream NLP tasks.

Introduction

The field of NLP has made significant strides with the introduction of LLMs leveraging self-supervised pre-training on extensive text corpora. While pre-training on such data is beneficial for a wide array of downstream tasks, it introduces a substantial computational overhead. Addressing these concerns, the present work introduces SpacTor, an efficient pre-training procedure for Text-to-Text Transfer Transformer (T5) models, which combines span corruption (SC) with a replaced token detection (RTD) objective. SpacTor not only maintains high task performance but also significantly reduces the computational resources required for pre-training T5 models.

Methodology

SpacTor's methodology hinges on a two-pronged approach, utilizing a hybrid pre-training objective. The approach combines SC and RTD, which were previously applied separately in models like ELECTRA. The RTD component introduces a generative aspect where an auxiliary generator predicts plausible alternatives for some tokens in the input, while the T5 discriminator detects these alterations. Initially training with this combined objective, SpacTor transitions to SC solely after a specified number of iterations (denoted by Ï„). This adjustment responds to the observation that while RTD proves beneficial in the initial stages of pre-training, its impact diminishes and becomes counterproductive as training continues.

Empirical Evaluation

The empirical evaluation of SpacTor focused on comparing its performance with standard SC pre-training. The results indicate that SpacTor achieves similar downstream task performance with a 50% reduction in the number of pre-training iterations and a 40% reduction in overall computational cost, as measured by floating-point operations (FLOPs). Furthermore, when allocated the same computational budget as standard practices, SpacTor demonstrates superior downstream benchmark performance, suggesting a significant boost in pre-training efficiency.

Conclusion and Potential Extensions

SpacTor stands out as a novel method to enhance the efficiency of pre-training T5 models. It showcases the potential of hybrid objectives in achieving high performance with reduced computational demands. Going forward, possible extensions could explore a seamless transition from the RTD to SC objectives during pre-training and apply this pre-training technique to other architectural variants. Future work might also explore the effects of SpacTor on even larger models and test its scalability across various NLP tasks.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 6 tweets and received 179 likes.

Upgrade to Pro to view all of the tweets about this paper: