Papers
Topics
Authors
Recent
Search
2000 character limit reached

InfoCSE: Enhanced Self-Supervised Sentence Embeddings

Updated 14 January 2026
  • InfoCSE is a self-supervised learning framework that refines sentence embeddings using mutual reinforcement or auxiliary masked language modeling.
  • It employs intra-document clustering to dynamically mine positive sentence pairs, thereby enhancing the quality of contrastive learning.
  • Empirical results demonstrate significant gains in retrieval and semantic similarity, outperforming traditional heuristic-based approaches.

InfoCSE encompasses two distinct but related frameworks for self-supervised learning of sentence embeddings: (1) "reInforced self-supervised Contrastive learning of Sentence Embeddings" (Yang et al., 2022), which emphasizes mutual reinforcement between model and annotation, and (2) "Information-aggregated Contrastive Learning of Sentence Embeddings" (Wu et al., 2022), which augments a contrastive objective with an auxiliary masked LLM (MLM) module to improve information density in the sentence representation. Both approaches aim to overcome the limitations of prior heuristic or weakly-constrained contrastive learning schemes, yielding significant improvements in representation quality as measured by retrieval and semantic similarity tasks.

1. Motivation and High-Level Concepts

Both InfoCSE frameworks address deficiencies in existing contrastive sentence embedding methods, which typically generate positive pairs via heuristics (e.g., data augmentation or local context) or enforce only that different stochastically encoded views of the same sentence map to similar embeddings. These strategies can produce trivial, coarse, or false positive pairs and may not sufficiently encourage embeddings to capture rich semantics.

The mutual reinforcement approach (Yang et al., 2022) iteratively leverages the current encoder to generate plausible positive pairs through intra-document clustering, thereby refining both the annotation and the representation in a feedback loop. The information-aggregated approach (Wu et al., 2022) tackles the weak constraint of contrastive learning by forcing the embedding to also enable reconstructive capability, introducing an auxiliary MLM head operating on the [CLS] representation without harming the contrastive geometry.

2. Core Algorithms and Model Architectures

  • Initialization: An off-the-shelf self-supervised contrastive learning approach (CONPONO, SimCSE, or ICT) provides the initial encoder f(0)f^{(0)}.
  • Intra-Document Clustering (IDC): For each document dd with sentences s1,,sms_1,\dots,s_m, compute embeddings, construct a fully connected graph with edge weights wij=f(t)(si),f(t)(sj)w_{ij} = \langle f^{(t)}(s_i), f^{(t)}(s_j) \rangle, prune to top-1 neighbors, and find connected components as clusters.
  • Positive Pair Mining: Every unordered sentence pair within a cluster is treated as a positive.
  • Contrastive Update: InfoNCE loss is minimized:

(a,p)=logexp(f(t)(a),f(t)(p))naexp(f(t)(a),f(t)(n))\ell(a,p) = -\log \frac{\exp(\langle f^{(t)}(a), f^{(t)}(p)\rangle)}{\sum_{n \neq a} \exp(\langle f^{(t)}(a), f^{(t)}(n)\rangle)}

for all positive pairs (a,p)(a,p) in P\mathcal{P}.

  • Iteration: Clustering and update steps repeat until validation Recall@5 (R@5) converges (2–3 iterations suffice).
  • Base Encoder: A pre-trained 12-layer BERT (base or large).
  • Contrastive Objective: Two dropout-augmented views of sentence xx (outputs h,h+h, h^+) are pushed together; other batch elements as negatives; cosine similarity with temperature scaling. The loss for a batch of size NN is:

Lcontrastive=i=1Nlogexp(sim(hi,hi+)/τ)j=1Nexp(sim(hi,hj+)/τ)\mathcal{L}_{\mathrm{contrastive}} = -\sum_{i=1}^N \log \frac{\exp(\mathrm{sim}(h_i, h_i^+)/\tau)}{\sum_{j=1}^N \exp(\mathrm{sim}(h_i, h_j^+)/\tau)}

  • Auxiliary MLM Network: An 8-layer Transformer; the first 6 layers initialized from BERT and frozen, the last 2 are trainable. Input is the [CLS] vector and detached 6th-layer token embeddings, concatenated and fed through the final two Transformer layers.
  • Auxiliary Task: A cross-entropy loss over MLM predictions from masked tokens:

LMLM=jmaskedCE(H~jW,xj)\mathcal{L}_{\mathrm{MLM}} = \sum_{j\in \mathrm{masked}} \mathrm{CE}(\widetilde H^j W, x^j)

where WW is the BERT projection matrix.

  • Total Loss:

Ltotal=Lcontrastive+λLMLM\mathcal{L}_{\mathrm{total}} = \mathcal{L}_{\mathrm{contrastive}} + \lambda \mathcal{L}_{\mathrm{MLM}}

with small λ\lambda (default 5×1035 \times 10^{-3}), as the MLM gradient magnitude is larger.

3. Training Regimes and Implementation Details

  • Backbone: BERT-base, 12 layers, 768 hidden size, 12 heads.
  • Input Processing: Sentences truncated/padded to 32 tokens; [CLS] or mean-pooling for representation; no additional projection head.
  • Optimization: Adam, learning rate 2×1052 \times 10^{-5}, batch size 3200.
  • IDC: Partition-based (top-1 neighbor) clustering.
  • Termination: Stop after R@5 on validation set plateaus.
  • Auxiliary MLM Net Pre-training: On BookCorpus+Wikipedia, 8 epochs, Adam (1×1041\times10^{-4}), batch size 1024 (8×V100 GPUs).
  • Contrastive+MLM Joint Training: On 1M Wikipedia sentences, Adam (3×1053\times10^{-5}), batch size 64 (1×3090 GPU), evaluate on STS-B dev set every 125 steps.
  • Mask Rate: MLM masking optimal at ~40%.
  • Pooling: Both raw [CLS] and BERT pooler variants tested; raw [CLS] marginally superior.

4. Empirical Performance and Ablation Studies

Scenario InfoCSE R@20 Best Baseline Absolute Gain
News (Zero-Shot) 0.786 DeCLUTR 0.650 +0.136
Web Doc (Zero-Shot) 0.554 DeCLUTR 0.510 +0.044
Web Browsing (Zero-Shot) 0.136 DeCLUTR 0.126 +0.010
  • Few-Shot (1k labels, R@5):
    • News: 0.648 vs baseline 0.582 (+11%)
    • Web Doc: 0.464 vs 0.437 (+6%)
    • Web Browsing: 0.089 vs 0.080 (+11%)
  • Ablations:
    • IDC (top-1) outperforms k-means by >10 pts R@5.
    • Most gains in iteration 2; iteration 3 yields diminishing returns.
    • Initializing from SimCSE or ICT yields robust convergence.
    • Seeding with SBERT boosts zero-shot by 1–2 pts.
    • InfoCSE++ (pretrain ↔ fine-tune loop) yields additional ~4 pts in few-shot.
Model BERT-base BERT-large
SimCSE 76.25 78.41
InfoCSE 78.85 80.18
(Δ to SimCSE) +2.60% +1.77%
  • STS-B (BERT-base): 82.00 (vs SimCSE 76.85, +5.15 pts)
  • Ablations:
    • Pretraining MLM auxiliary crucial (drops from 85.49 to 83.73 STS-B if omitted).
    • Removing MLM degenerates to SimCSE; removing contrastive loss, performance collapses.
    • Detaching gradients through frozen layers is beneficial.
    • Mask rate 40% optimal.
    • Small λ essential; large λ degrades contrastive learning.
    • Raw [CLS] marginally benefits over post-pooler.
    • Co-training with DiffCSE’s replaced-token-detection head yields further improvements.

5. Comparative Analyses and Baseline Context

Both InfoCSE frameworks outperform view-based (SimCSE, ConSERT, Mirror-BERT), context-based (ICT, CPC, DeCLUTR), and neighbor-based (CONPONO) contrastive methods across retrieval and semantic similarity tasks, both in zero-shot and few-shot regimes (Yang et al., 2022, Wu et al., 2022). Mutual reinforcement and information-aggregation strategies address the core weaknesses of prior positive sampling and information sparsity.

6. Limitations and Prospects for Future Research

  • Implementation Overhead: IDC clustering introduces modest computational cost, mitigated by graph sparsification and efficient connected component analysis.
  • Model Architecture: Framework agnostic; extension to alternative backbones (RoBERTa, ELECTRA, DeBERTa), adaptive projection heads, or multilingual/multimodal encoders is feasible.
  • Auxiliary Objectives: Integration of span-prediction, sentence-order prediction, or combinations (as with replaced-token-detection) may further enhance embedding quality.
  • Supervised Fine-Tuning: Exploration of NLI fine-tuning and semi-supervised pretrain↔fine-tune loops (InfoCSE++) shows additional gains in data-scarce settings.
  • Clustering Refinement: Application of dynamic KK, graph neural networks, or improved InfoNCE variants (temperature scaling, momentum encoders) are proposed for future work.

7. Synthesis and Impact

InfoCSE frameworks demonstrate that tightly integrated self-supervision cycles (via mutual reinforcement or auxiliary MLM aggregation) enable the learning of richer, more transferable sentence representations. Both mutual reinforcement and information-aggregation approaches produce state-of-the-art performance in unsupervised and semi-supervised benchmarks, affirming the efficacy of model-driven positive mining and information injection mechanisms. The results substantiate the view that careful melding of information aggregation and contrastive objectives is a robust pathway toward high-quality sentence embeddings (Yang et al., 2022, Wu et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to InfoCSE.