InfoCSE: A New Direction in Contrastive Sentence Embeddings
The paper presents a novel approach to unsupervised sentence representation learning through an innovative contrastive learning framework, InfoCSE. This framework addresses limitations inherent in previous models like SimCSE by integrating a compelling sentence reconstruction task in addition to the base contrastive objective. This integration provides a richer semantic representation by aggregating information using a novel auxiliary network specifically tailored to avoid traditional pitfalls observed in previous methodologies.
Core Contributions and Methodology
InfoCSE introduces a novel auxiliary network to tackle the over-update issue caused by directly optimizing the Masked LLM (MLM) objective in contrastive learning frameworks. Unlike SimCSE, which experienced performance drops in semantic textual similarity (STS) tasks when incorporating MLM objectives, InfoCSE proposes a specialized network that strategically limits the gradient updates propagated back to the primary BERT encoder. This architectural innovation enhances the usability of the [CLS] embeddings in both reconstructive and contrastive tasks by disentangling these objectives through an auxiliary network that back-propagates only through [CLS], thus stabilizing the encoder's parameter updates and enhancing semantic richness.
InfoCSE's architecture includes an auxiliary eight-layer transformer that integrates outputs from a frozen six-layer subset of the BERT encoder with the [CLS] vector from its twelfth layer. This design choice ensures that sentence representations derive dense semantic information without overwhelming the primary model with the variance introduced by MLM optimization. The joint learning process involves a controlled back-propagation mechanism facilitated by gradient detachments from certain layers, critical in achieving precise updates without destabilizing the embedding framework.
Experimental Validation
The experimental evaluation demonstrates that InfoCSE achieves state-of-the-art performance in unsupervised sentence representation tasks. In semantic textual similarity evaluations, InfoCSE surpassed SimCSE, achieving an average Spearman correlation improvement of 2.60% on the BERT-base model and 1.77% on the BERT-large model across seven STS datasets. Moreover, InfoCSE outperformed competing models on the BEIR benchmark, indicating superior generalization to diverse retrieval scenarios.
Extensive ablation studies confirm the auxiliary network's crucial role, underscoring the importance of pre-training this component to initialize the system effectively and ensure robust joint learning outcomes. Additionally, varying the MLM mask rate and employing gradient detachment were empirically validated as impactful hyperparameters, fine-tuning the delicate balance required for optimal semantic encoding.
Implications and Future Directions
The InfoCSE framework represents a significant stride in the contrastive learning hierarchy by harmonizing reconstructive and discriminative objectives within a unified semantic space. This paradigm shift not only refines the quality of sentence embeddings but opens novel avenues for auxiliary objective integration. The compatibility demonstrated with diverse learning objectives, such as MLM and Replaced Token Detection (RTD), hints at potential collaborative optimization strategies that can fortify semantic encoding even further in contrastive contexts.
Future research may expand upon this framework by exploring different auxiliary objectives and fine-tuning pre-training techniques to enhance transfer learning applications. Additionally, the principles poised by InfoCSE could serve as a catalyst for adapting supervised sentence embedding frameworks to leverage nuanced semantic interrelations without compromising the independent trainable objectives present in supervised settings.
Thus, InfoCSE offers a robust blueprint for enhancing unsupervised sentence embeddings through adept architecture designs and learning processes that capture and maintain semantically rich and task-agnostic sentence representations.