Supervised Contrastive Learning for Pre-trained LLM Fine-tuning
In this paper, the authors propose an innovative approach to fine-tuning pre-trained LLMs by integrating a supervised contrastive learning (SCL) objective alongside the conventional cross-entropy (CE) loss. It addresses several inherent shortcomings of CE, such as sub-optimal generalization and instability, particularly notable in few-shot learning settings.
Approach and Methodology
The proposed objective leverages a novel SCL term designed to refine representation learning by encouraging samples of the same class to cluster closer together while pushing samples of different classes apart. This method draws inspiration from successful self-supervised learning strategies in other domains but innovatively applies it in a supervised context for NLP tasks. The SCL term is incorporated with CE loss in a weighted fashion, where the tuning of hyperparameters, such as the temperature parameter τ and the weighting coefficient λ, plays a crucial role in optimizing performance.
Experimental Evaluation and Results
The experiments conducted primarily utilize the GLUE benchmark, focusing on tasks ranging from sentiment analysis (SST-2) to textual entailment (QNLI, MNLI). The findings are compelling:
- Few-shot Learning Improvements:
- For 20 training examples, the SCL-enhanced model improved QNLI results by 10.7 points compared to baseline CE, indicating robust performance with minimal data.
- As the training size increases (N=100, 1000), improvements persist but with diminishing returns, highlighting the approach's strength in data-scarce scenarios.
- Robustness to Noise:
- The model's robustness was evaluated using augmented noisy datasets, constructed via back-translation with varying noise levels. SCL significantly enhanced model performance under noisy conditions, especially for inference tasks like MNLI, achieving up to a 7-point gain at higher noise levels.
- Full Dataset Performance:
- Although less pronounced than in few-shot settings, notable gains were observed in fully supervised environments, including a 3.5-point increase in QNLI accuracy, suggesting that SCL can benefit conventional data-rich scenarios as well.
- Generalization to Related Tasks:
- Transfer learning experiments demonstrated improvements when applying models fine-tuned with the SCL objective to related datasets such as Amazon-2 and Yelp-2, attesting to enhanced generalizability.
Implications and Future Prospects
The incorporation of SCL into the fine-tuning pipeline of pre-trained LLMs not only improves the convergence stability and consistency across different runs but also promises more robust and generalizable models, particularly in few-shot learning applications. It opens avenues for further exploring contrastive learning mechanisms, potentially enhancing semi-supervised and unsupervised NLP applications.
Future research might focus on scaling this approach, perhaps incorporating automated data augmentation techniques or optimizing batch sizes for contrastive learning, to further elevate its efficacy across broader tasks and datasets in natural language understanding.