SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection (2401.13160v1)
Abstract: Pre-training LLMs is known to be extremely resource intensive and often times inefficient, under-utilizing the information encapsulated in the training text sequences. In this paper, we present SpacTor, a new training procedure consisting of (1) a hybrid objective combining span corruption (SC) and token replacement detection (RTD), and (2) a two-stage curriculum that optimizes the hybrid objective over the initial $\tau$ iterations, then transitions to standard SC loss. We show empirically that the effectiveness of the hybrid objective is tied to the two-stage pre-training schedule, and provide extensive analysis on why this is the case. In our experiments with encoder-decoder architectures (T5) on a variety of NLP tasks, SpacTor-T5 yields the same downstream performance as standard SC pre-training, while enabling a 50% reduction in pre-training iterations and 40% reduction in total FLOPs. Alternatively, given the same amount of computing budget, we find that SpacTor results in significantly improved downstream benchmark performance.
- Ext5: Towards extreme multi-task scaling for transfer learning. In International Conference on Learning Representations.
- Metro: Efficient denoising pretraining of large scale autoencoding language models with model generated signals. arXiv preprint arXiv:2204.06644.
- Findings of the 2020 conference on machine translation (WMT20). In Proceedings of the Fifth Conference on Machine Translation, pages 1–55, Online. Association for Computational Linguistics.
- Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48.
- A curriculum learning method for improved noise robustness in automatic speech recognition. In 2017 25th European Signal Processing Conference (EUSIPCO), pages 548–552. IEEE.
- Language models are few-shot learners. In NeurIPS.
- Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- ELECTRA: Pre-training text encoders as discriminators rather than generators. In ICLR.
- Semi-supervised sequence learning. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Unified language model pre-training for natural language understanding and generation. Advances in Neural Information Processing Systems, 32.
- Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online. Association for Computational Linguistics.
- Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations.
- Measuring massive multitask language understanding. In International Conference on Learning Representations.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
- Teaching machines to read and comprehend. Advances in neural information processing systems, 28.
- Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77.
- The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
- Generating wikipedia by summarizing long sequences. In International Conference on Learning Representations.
- Unicorn on rainbow: A universal commonsense reasoning model on a new multitask benchmark. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13480–13488.
- COCO-LM: Correcting and contrasting text sequences for language model pretraining. In Conference on Neural Information Processing Systems.
- Pretraining text encoders with adversarial mixture of training signal generators. In International Conference on Learning Representations.
- Improving language understanding by generative pre-training. OpenAI blog.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Exploring the limits of transfer learning with a unified text-to-text transformer. In JMLR.
- Squad: 100, 000+ questions for machine comprehension of text. In EMNLP.
- Unsupervised pretraining for sequence to sequence learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 383–391, Copenhagen, Denmark. Association for Computational Linguistics.
- Scaling up models and data with t5x and seqio. arXiv preprint arXiv:2203.17189.
- Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
- Shazeer, N. (2020). Glu variants improve transformer. arXiv preprint arXiv:2002.05202.
- Adafactor: Adaptive learning rates with sublinear memory cost. In Dy, J. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4596–4604. PMLR.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
- Unifying language learning paradigms. arXiv preprint arXiv:2205.05131.
- Transcending scaling laws with 0.1% extra compute. arXiv preprint arXiv:2210.11399.
- Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
- What language model architecture and pretraining objective work best for zero-shot generalization? arXiv preprint arXiv:2204.05832.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- Emergent abilities of large language models. Transactions on Machine Learning Research.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
- mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934.
- Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning, pages 11328–11339. PMLR.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.