Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks
This paper examines an innovative approach to enhancing the performance of sentence encoders through Supplementary Training on Intermediate Labeled-data Tasks (STILTs). The methodology builds upon the prevalent paradigm of pretraining models on large-scale unsupervised tasks followed by fine-tuning on specific target tasks. The central hypothesis is that incorporating an additional stage of training on data-rich supervised tasks can significantly improve model performance, particularly in data-constrained settings.
Main Contributions
- Introduction of STILTs: The authors propose STILTs, a supplementary training phase on intermediate tasks. These tasks provide additional labeled data which can bridge the gap between initial unsupervised pretraining and final supervised fine-tuning. The result is a potentially more robust and effective target task model.
- Application to Existing Models: STILTs is applied to three well-known pretrained sentence encoders: BERT, GPT, and ELMo. The authors use a selection of intermediate tasks such as MNLI, SNLI, QQP, and a custom fake-sentence-detection task to empirically validate the efficacy of STILTs.
- Performance on GLUE Benchmark: STILTs demonstrates substantial performance gains on the GLUE benchmark, particularly in tasks with limited training data. Notably, BERT fine-tuned with STILTs achieves a GLUE score of 81.8, setting a new state of the art at the time.
- Reduced Variance in Training: Additionally, STILTs reduces variance in model performance across random restarts, which is particularly beneficial for tasks with small datasets where fine-tuning can be unstable.
- Data-Constrained Regimes: The paper includes experiments that simulate data-constrained scenarios by limiting the training set to 1k and 5k examples. In such regimes, STILTs provides even more pronounced performance gains, reinforcing its utility in real-world situations where labeled data is often scarce.
Implications and Future Directions
The findings suggest that STILTs can be a valuable extension to the current pretraining and fine-tuning paradigm, offering improvements in both accuracy and stability. This approach allows for a more effective use of the intermediate tasks that share some structural similarity with the target tasks, or are simply data-rich.
Future work could explore a broader selection of intermediate tasks and paper the interactions between different intermediate and target tasks to fully exploit STILTs' potential. Additionally, this approach could be valuable beyond natural language tasks, extending to various domains involving structured prediction problems.
Overall, STILTs represents a strategic enhancement to model training processes, advocating for a more nuanced view of transfer learning in the context of deep LLMs.