Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks (1811.01088v2)

Published 2 Nov 2018 in cs.CL

Abstract: Pretraining sentence encoders with LLMing and related unsupervised tasks has recently been shown to be very effective for language understanding tasks. By supplementing LLM-style pretraining with further training on data-rich supervised tasks, such as natural language inference, we obtain additional performance improvements on the GLUE benchmark. Applying supplementary training on BERT (Devlin et al., 2018), we attain a GLUE score of 81.8---the state of the art (as of 02/24/2019) and a 1.4 point improvement over BERT. We also observe reduced variance across random restarts in this setting. Our approach yields similar improvements when applied to ELMo (Peters et al., 2018a) and Radford et al. (2018)'s model. In addition, the benefits of supplementary training are particularly pronounced in data-constrained regimes, as we show in experiments with artificially limited training data.

Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks

This paper examines an innovative approach to enhancing the performance of sentence encoders through Supplementary Training on Intermediate Labeled-data Tasks (STILTs). The methodology builds upon the prevalent paradigm of pretraining models on large-scale unsupervised tasks followed by fine-tuning on specific target tasks. The central hypothesis is that incorporating an additional stage of training on data-rich supervised tasks can significantly improve model performance, particularly in data-constrained settings.

Main Contributions

  1. Introduction of STILTs: The authors propose STILTs, a supplementary training phase on intermediate tasks. These tasks provide additional labeled data which can bridge the gap between initial unsupervised pretraining and final supervised fine-tuning. The result is a potentially more robust and effective target task model.
  2. Application to Existing Models: STILTs is applied to three well-known pretrained sentence encoders: BERT, GPT, and ELMo. The authors use a selection of intermediate tasks such as MNLI, SNLI, QQP, and a custom fake-sentence-detection task to empirically validate the efficacy of STILTs.
  3. Performance on GLUE Benchmark: STILTs demonstrates substantial performance gains on the GLUE benchmark, particularly in tasks with limited training data. Notably, BERT fine-tuned with STILTs achieves a GLUE score of 81.8, setting a new state of the art at the time.
  4. Reduced Variance in Training: Additionally, STILTs reduces variance in model performance across random restarts, which is particularly beneficial for tasks with small datasets where fine-tuning can be unstable.
  5. Data-Constrained Regimes: The paper includes experiments that simulate data-constrained scenarios by limiting the training set to 1k and 5k examples. In such regimes, STILTs provides even more pronounced performance gains, reinforcing its utility in real-world situations where labeled data is often scarce.

Implications and Future Directions

The findings suggest that STILTs can be a valuable extension to the current pretraining and fine-tuning paradigm, offering improvements in both accuracy and stability. This approach allows for a more effective use of the intermediate tasks that share some structural similarity with the target tasks, or are simply data-rich.

Future work could explore a broader selection of intermediate tasks and paper the interactions between different intermediate and target tasks to fully exploit STILTs' potential. Additionally, this approach could be valuable beyond natural language tasks, extending to various domains involving structured prediction problems.

Overall, STILTs represents a strategic enhancement to model training processes, advocating for a more nuanced view of transfer learning in the context of deep LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Jason Phang (40 papers)
  2. Samuel R. Bowman (103 papers)
  3. Thibault Févry (8 papers)
Citations (457)
X Twitter Logo Streamline Icon: https://streamlinehq.com