SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection (2401.13160v1)

Published 24 Jan 2024 in cs.LG and cs.CL

Abstract: Pre-training LLMs is known to be extremely resource intensive and often times inefficient, under-utilizing the information encapsulated in the training text sequences. In this paper, we present SpacTor, a new training procedure consisting of (1) a hybrid objective combining span corruption (SC) and token replacement detection (RTD), and (2) a two-stage curriculum that optimizes the hybrid objective over the initial $\tau$ iterations, then transitions to standard SC loss. We show empirically that the effectiveness of the hybrid objective is tied to the two-stage pre-training schedule, and provide extensive analysis on why this is the case. In our experiments with encoder-decoder architectures (T5) on a variety of NLP tasks, SpacTor-T5 yields the same downstream performance as standard SC pre-training, while enabling a 50% reduction in pre-training iterations and 40% reduction in total FLOPs. Alternatively, given the same amount of computing budget, we find that SpacTor results in significantly improved downstream benchmark performance.

References (46)

Summary

The paper presents a hybrid pre-training method that integrates span corruption with replaced token detection to maintain high task performance while reducing computation.
The methodology uses an auxiliary generator for token replacement and shifts from dual-objective training to span corruption alone for optimal efficiency.
Empirical evaluation shows a 50% reduction in pre-training iterations and 40% lower FLOPs, verifying the method's efficiency on downstream NLP tasks.

Introduction

The field of NLP has made significant strides with the introduction of LLMs leveraging self-supervised pre-training on extensive text corpora. While pre-training on such data is beneficial for a wide array of downstream tasks, it introduces a substantial computational overhead. Addressing these concerns, the present work introduces SpacTor, an efficient pre-training procedure for Text-to-Text Transfer Transformer (T5) models, which combines span corruption (SC) with a replaced token detection (RTD) objective. SpacTor not only maintains high task performance but also significantly reduces the computational resources required for pre-training T5 models.

Methodology

SpacTor's methodology hinges on a two-pronged approach, utilizing a hybrid pre-training objective. The approach combines SC and RTD, which were previously applied separately in models like ELECTRA. The RTD component introduces a generative aspect where an auxiliary generator predicts plausible alternatives for some tokens in the input, while the T5 discriminator detects these alterations. Initially training with this combined objective, SpacTor transitions to SC solely after a specified number of iterations (denoted by τ). This adjustment responds to the observation that while RTD proves beneficial in the initial stages of pre-training, its impact diminishes and becomes counterproductive as training continues.

Empirical Evaluation

The empirical evaluation of SpacTor focused on comparing its performance with standard SC pre-training. The results indicate that SpacTor achieves similar downstream task performance with a 50% reduction in the number of pre-training iterations and a 40% reduction in overall computational cost, as measured by floating-point operations (FLOPs). Furthermore, when allocated the same computational budget as standard practices, SpacTor demonstrates superior downstream benchmark performance, suggesting a significant boost in pre-training efficiency.

Conclusion and Potential Extensions

SpacTor stands out as a novel method to enhance the efficiency of pre-training T5 models. It showcases the potential of hybrid objectives in achieving high performance with reduced computational demands. Going forward, possible extensions could explore a seamless transition from the RTD to SC objectives during pre-training and apply this pre-training technique to other architectural variants. Future work might also explore the effects of SpacTor on even larger models and test its scalability across various NLP tasks.