InfoCSE: Enhanced Self-Supervised Sentence Embeddings
- InfoCSE is a self-supervised learning framework that refines sentence embeddings using mutual reinforcement or auxiliary masked language modeling.
- It employs intra-document clustering to dynamically mine positive sentence pairs, thereby enhancing the quality of contrastive learning.
- Empirical results demonstrate significant gains in retrieval and semantic similarity, outperforming traditional heuristic-based approaches.
InfoCSE encompasses two distinct but related frameworks for self-supervised learning of sentence embeddings: (1) "reInforced self-supervised Contrastive learning of Sentence Embeddings" (Yang et al., 2022), which emphasizes mutual reinforcement between model and annotation, and (2) "Information-aggregated Contrastive Learning of Sentence Embeddings" (Wu et al., 2022), which augments a contrastive objective with an auxiliary masked LLM (MLM) module to improve information density in the sentence representation. Both approaches aim to overcome the limitations of prior heuristic or weakly-constrained contrastive learning schemes, yielding significant improvements in representation quality as measured by retrieval and semantic similarity tasks.
1. Motivation and High-Level Concepts
Both InfoCSE frameworks address deficiencies in existing contrastive sentence embedding methods, which typically generate positive pairs via heuristics (e.g., data augmentation or local context) or enforce only that different stochastically encoded views of the same sentence map to similar embeddings. These strategies can produce trivial, coarse, or false positive pairs and may not sufficiently encourage embeddings to capture rich semantics.
The mutual reinforcement approach (Yang et al., 2022) iteratively leverages the current encoder to generate plausible positive pairs through intra-document clustering, thereby refining both the annotation and the representation in a feedback loop. The information-aggregated approach (Wu et al., 2022) tackles the weak constraint of contrastive learning by forcing the embedding to also enable reconstructive capability, introducing an auxiliary MLM head operating on the [CLS] representation without harming the contrastive geometry.
2. Core Algorithms and Model Architectures
InfoCSE (Mutual Reinforcement, (Yang et al., 2022))
- Initialization: An off-the-shelf self-supervised contrastive learning approach (CONPONO, SimCSE, or ICT) provides the initial encoder .
- Intra-Document Clustering (IDC): For each document with sentences , compute embeddings, construct a fully connected graph with edge weights , prune to top-1 neighbors, and find connected components as clusters.
- Positive Pair Mining: Every unordered sentence pair within a cluster is treated as a positive.
- Contrastive Update: InfoNCE loss is minimized:
for all positive pairs in .
- Iteration: Clustering and update steps repeat until validation Recall@5 (R@5) converges (2–3 iterations suffice).
InfoCSE (Information-Aggregated, (Wu et al., 2022))
- Base Encoder: A pre-trained 12-layer BERT (base or large).
- Contrastive Objective: Two dropout-augmented views of sentence (outputs ) are pushed together; other batch elements as negatives; cosine similarity with temperature scaling. The loss for a batch of size is:
- Auxiliary MLM Network: An 8-layer Transformer; the first 6 layers initialized from BERT and frozen, the last 2 are trainable. Input is the [CLS] vector and detached 6th-layer token embeddings, concatenated and fed through the final two Transformer layers.
- Auxiliary Task: A cross-entropy loss over MLM predictions from masked tokens:
where is the BERT projection matrix.
- Total Loss:
with small (default ), as the MLM gradient magnitude is larger.
3. Training Regimes and Implementation Details
Mutual Reinforcement (Yang et al., 2022)
- Backbone: BERT-base, 12 layers, 768 hidden size, 12 heads.
- Input Processing: Sentences truncated/padded to 32 tokens; [CLS] or mean-pooling for representation; no additional projection head.
- Optimization: Adam, learning rate , batch size 3200.
- IDC: Partition-based (top-1 neighbor) clustering.
- Termination: Stop after R@5 on validation set plateaus.
Information-Aggregated (Wu et al., 2022)
- Auxiliary MLM Net Pre-training: On BookCorpus+Wikipedia, 8 epochs, Adam (), batch size 1024 (8×V100 GPUs).
- Contrastive+MLM Joint Training: On 1M Wikipedia sentences, Adam (), batch size 64 (1×3090 GPU), evaluate on STS-B dev set every 125 steps.
- Mask Rate: MLM masking optimal at ~40%.
- Pooling: Both raw [CLS] and BERT pooler variants tested; raw [CLS] marginally superior.
4. Empirical Performance and Ablation Studies
Mutual Reinforcement (Yang et al., 2022)
| Scenario | InfoCSE R@20 | Best Baseline | Absolute Gain |
|---|---|---|---|
| News (Zero-Shot) | 0.786 | DeCLUTR 0.650 | +0.136 |
| Web Doc (Zero-Shot) | 0.554 | DeCLUTR 0.510 | +0.044 |
| Web Browsing (Zero-Shot) | 0.136 | DeCLUTR 0.126 | +0.010 |
- Few-Shot (1k labels, R@5):
- News: 0.648 vs baseline 0.582 (+11%)
- Web Doc: 0.464 vs 0.437 (+6%)
- Web Browsing: 0.089 vs 0.080 (+11%)
- Ablations:
Information-Aggregated (Wu et al., 2022)
| Model | BERT-base | BERT-large |
|---|---|---|
| SimCSE | 76.25 | 78.41 |
| InfoCSE | 78.85 | 80.18 |
| (Δ to SimCSE) | +2.60% | +1.77% |
- STS-B (BERT-base): 82.00 (vs SimCSE 76.85, +5.15 pts)
- Ablations:
- Pretraining MLM auxiliary crucial (drops from 85.49 to 83.73 STS-B if omitted).
- Removing MLM degenerates to SimCSE; removing contrastive loss, performance collapses.
- Detaching gradients through frozen layers is beneficial.
- Mask rate 40% optimal.
- Small λ essential; large λ degrades contrastive learning.
- Raw [CLS] marginally benefits over post-pooler.
- Co-training with DiffCSE’s replaced-token-detection head yields further improvements.
5. Comparative Analyses and Baseline Context
Both InfoCSE frameworks outperform view-based (SimCSE, ConSERT, Mirror-BERT), context-based (ICT, CPC, DeCLUTR), and neighbor-based (CONPONO) contrastive methods across retrieval and semantic similarity tasks, both in zero-shot and few-shot regimes (Yang et al., 2022, Wu et al., 2022). Mutual reinforcement and information-aggregation strategies address the core weaknesses of prior positive sampling and information sparsity.
6. Limitations and Prospects for Future Research
- Implementation Overhead: IDC clustering introduces modest computational cost, mitigated by graph sparsification and efficient connected component analysis.
- Model Architecture: Framework agnostic; extension to alternative backbones (RoBERTa, ELECTRA, DeBERTa), adaptive projection heads, or multilingual/multimodal encoders is feasible.
- Auxiliary Objectives: Integration of span-prediction, sentence-order prediction, or combinations (as with replaced-token-detection) may further enhance embedding quality.
- Supervised Fine-Tuning: Exploration of NLI fine-tuning and semi-supervised pretrain↔fine-tune loops (InfoCSE++) shows additional gains in data-scarce settings.
- Clustering Refinement: Application of dynamic , graph neural networks, or improved InfoNCE variants (temperature scaling, momentum encoders) are proposed for future work.
7. Synthesis and Impact
InfoCSE frameworks demonstrate that tightly integrated self-supervision cycles (via mutual reinforcement or auxiliary MLM aggregation) enable the learning of richer, more transferable sentence representations. Both mutual reinforcement and information-aggregation approaches produce state-of-the-art performance in unsupervised and semi-supervised benchmarks, affirming the efficacy of model-driven positive mining and information injection mechanisms. The results substantiate the view that careful melding of information aggregation and contrastive objectives is a robust pathway toward high-quality sentence embeddings (Yang et al., 2022, Wu et al., 2022).