Papers
Topics
Authors
Recent
Search
2000 character limit reached

Batch-Softmax Contrastive Loss (BSC)

Updated 12 April 2026
  • Batch-Softmax Contrastive Loss (BSC) is a loss formulation that optimizes transformer-based sentence embeddings by using in-batch negatives and softmax normalization.
  • It supports both symmetric and asymmetric variants, leading to empirical improvements across ranking, classification, and regression tasks.
  • Key factors such as batch composition, temperature tuning, and embedding normalization are critical for achieving robust performance in pairwise sentence scoring.

Batch-Softmax Contrastive Loss (BSC) is a loss formulation designed to optimize fine-tuned transformer-based sentence embedding models for pairwise sentence scoring tasks, including ranking, classification, and regression. BSC systematically exploits in-batch negatives via a softmax over similarity scores between query–answer pairs, and it supports both symmetric (bi-directional) and asymmetric variants. BSC demonstrates sizable empirical improvements over standard pointwise and triplet-based losses across diverse datasets when combined with appropriate data shuffling, normalization, and temperature tuning (Chernyavskiy et al., 2021).

1. Formal Definition

Let a batch consist of MM positive pairs X={(qi,ai)}i=1MX = \{ (q_i, a_i) \}_{i=1}^M with qi,aiRdq_i, a_i \in \mathbb{R}^d representing embeddings for queries and answers, respectively. Typically, these embeddings are produced either by distinct or shared-weight encoders. The core similarity function is the (optionally normalized) dot product, sij=qiajs_{ij} = q_i^\top a_j, which may correspond to cosine similarity if the vectors are L2-normalized.

Let τ>0\tau > 0 denote the temperature parameter that modulates the softness of the softmax.

Define two complementary softmax-based losses:

  • L0L_0: Contrasts each query against all batch answers (matrix row-wise).
  • L1L_1: Contrasts each answer against all batch queries (matrix column-wise).

Explicitly: L0=1Mi=1Mlogexp(sii/τ)j=1Mexp(sij/τ)L_0 = -\frac{1}{M}\sum_{i=1}^M \log \frac{\exp(s_{ii}/\tau)}{\sum_{j=1}^M \exp(s_{ij}/\tau)}

L1=1Mi=1Mlogexp(sii/τ)j=1Mexp(sji/τ)L_1 = -\frac{1}{M}\sum_{i=1}^M \log \frac{\exp(s_{ii}/\tau)}{\sum_{j=1}^M \exp(s_{ji}/\tau)}

The total Batch-Softmax Contrastive loss is

LBSC(X)=L0(X)+L1(X)L_{\mathrm{BSC}}(X) = L_0(X) + L_1(X)

Equivalently: X={(qi,ai)}i=1MX = \{ (q_i, a_i) \}_{i=1}^M0 with X={(qi,ai)}i=1MX = \{ (q_i, a_i) \}_{i=1}^M1 arising by swapping X={(qi,ai)}i=1MX = \{ (q_i, a_i) \}_{i=1}^M2 and X={(qi,ai)}i=1MX = \{ (q_i, a_i) \}_{i=1}^M3.

The diagonal entries X={(qi,ai)}i=1MX = \{ (q_i, a_i) \}_{i=1}^M4 serve as positives; off-diagonal entries serve as negatives for each query or answer. Lower X={(qi,ai)}i=1MX = \{ (q_i, a_i) \}_{i=1}^M5 increases the loss' sensitivity to hard negatives.

2. Core Variants of BSC

Several key variations refine BSC to accommodate data structure and task requirements:

  • Symmetrization: Both X={(qi,ai)}i=1MX = \{ (q_i, a_i) \}_{i=1}^M6 and X={(qi,ai)}i=1MX = \{ (q_i, a_i) \}_{i=1}^M7 are included for bi-directional alignment. Omitting X={(qi,ai)}i=1MX = \{ (q_i, a_i) \}_{i=1}^M8 yields an asymmetric variant.
  • Labeled Negatives (Supervised Contrastive Loss): For batches containing non-paired (label 0) and paired (label 1) examples, the summation and averaging restrict to true positives:

X={(qi,ai)}i=1MX = \{ (q_i, a_i) \}_{i=1}^M9

Analogous modification applies for qi,aiRdq_i, a_i \in \mathbb{R}^d0.

  • Combo Loss (Pairwise + Pointwise): BSC is linearly combined with pointwise MSE or classification loss:

qi,aiRdq_i, a_i \in \mathbb{R}^d1

with qi,aiRdq_i, a_i \in \mathbb{R}^d2 and qi,aiRdq_i, a_i \in \mathbb{R}^d3 for regression targets qi,aiRdq_i, a_i \in \mathbb{R}^d4.

  • Embedding Normalization: Embeddings may be L2-normalized (unit sphere: qi,aiRdq_i, a_i \in \mathbb{R}^d5 as cosine similarity), normalized per-batch-axis, or with per-dimension min–max scaling as additional regularization (Chernyavskiy et al., 2021).

3. Training Procedure: Batch Construction and Shuffling

Batch composition is critical, particularly in NLP where batches are small (typically 30–50), often lacking hard negatives by chance. BSC benefits from carefully structured batching strategies:

  • Example-Based Shuffling: Each epoch, batches are constructed via k-NN grouping among the dataset, ensuring each batch contains near neighbors by one side of the pair (e.g., queries). The k-NN search uses FAISS in a two-stage process, recomputing current model embeddings at each epoch to maintain adaptive hard negatives.
  • Algorithmic Overview:
    • Shuffle the dataset randomly.
    • For each unused example, select top qi,aiRdq_i, a_i \in \mathbb{R}^d6 nearest neighbors (by embedding similarity), forming a group of size qi,aiRdq_i, a_i \in \mathbb{R}^d7.
    • Continue until all examples are assigned.
    • Reverse the constructed list to start with simpler groups.
  • Fast Shuffling Alternatives: For scalability with large datasets, cluster-based or “shingle-based” (hashing by random qi,aiRdq_i, a_i \in \mathbb{R}^d8-word sequences) grouping substitutes k-NN, enabling efficient group assignment and subsequent batching.

4. Hyperparameter Choices and Ablation Study

Key hyperparameters and empirical ablations strongly influence BSC effectiveness:

  • Batch size: 30 (most tasks), 50 (Antique, QQP)
  • Learning rates: qi,aiRdq_i, a_i \in \mathbb{R}^d9; typically 2e-5 or 3e-5
  • Optimizer: AdamW, with bias correction on CQA
  • Warm-up: 10% total steps
  • Sequence length: 90 subtokens
  • Epochs: 5–7 depending on dataset
  • Temperature sij=qiajs_{ij} = q_i^\top a_j0: sij=qiajs_{ij} = q_i^\top a_j10.05–0.1 for L2 normalization, sij=qiajs_{ij} = q_i^\top a_j21.0–3.0 for coordinate normalization. Typical values: 0.055 (CQA-A), 0.07 (CQA-B), 0.1 (others), 1.2 (PFCC-S & QQP).
  • Example group size (sij=qiajs_{ij} = q_i^\top a_j3): 4 (MRPC), 5 (CQA-B), 8 (others)
  • Combo weight (sij=qiajs_{ij} = q_i^\top a_j4): 0.9 (ranking), 0.1 (classification, regression)
  • Seed selection: 5 runs per configuration, selecting best dev seed

Ablations reveal:

  • Removing diagonal positives drops MRR by ≈0.02 (Antique)
  • Shuffling policy can alter MAP/MRR by up to 6 pp on CQA
  • Omitting normalization reduces PFCC-S HP@1 from .673 to .608
  • Improper sij=qiajs_{ij} = q_i^\top a_j5 may degrade performance by >10%

5. Empirical Performance on Sentence Scoring Tasks

A variety of pairwise sentence tasks demonstrate BSC's comparative strengths. The following summarizes key metrics:

Task Baseline (MSE) BSC Only BSC+MSE Combo Best Prior
Antique (MRR) 0.781 0.804 0.822 0.797 (Triplet)
CQA-A (MAP) 0.869 0.801 0.872 0.884 (Supplied)
CQA-B (MAP) 0.471 0.495 0.496 0.475 (Triplet)
PFCC-S (HP@1) 0.362 0.673 0.668 (Triplet)
MRPC (F1) 89.08 86.73 89.46 87.89–88.55 (SBERT)
QQP (F1) 74.29 73.13 75.07 74.97–79.77 (SBERT)
STS-b (ρ×100) 84.80 83.26 84.59 85.71 (BSC→MSE)

For ranking tasks (Antique, CQA-B, PFCC-S), BSC or BSC+MSE achieves or exceeds state-of-the-art results, especially in HP@1 for PFCC-S. On regression (STS-b), fine-tuning order impacts maximum attainable correlation. For MRPC and QQP, BSC+MSE matches or slightly outperforms augmented SBERT MSE baselines.

6. Analytical Insights and Recommendations

Empirical and analytical examination produces several practical recommendations:

  • Combo Training: BSC combined with MSE (or cross-entropy) typically attains optimal or near-optimal results, most notably in ranking tasks.
  • Batch Construction: Both composition and ordering are crucial; example-based or cluster shuffling outperforms random shuffling by several MAP/MRR points.
  • Labeled Negatives: Incorporation of supervised negatives (when available) offers a 1–2 pp absolute improvement.
  • Normalization: Embedding normalization (at least L2) is vital for performance; coordinate normalization may be preferred for fact-checking.
  • Temperature: sij=qiajs_{ij} = q_i^\top a_j6 should be jointly tuned with normalization; values misaligned with normalization can degrade performance by >10%.
  • Robustness: Failing to employ appropriate negatives, batch construction, temperature, or normalization can impair performance by double-digit percentage points.
  • Operational Guidance: For ranking, BSC by itself often surpasses pointwise objectives; for classification or regression, BSC is effective as pre-training or in conjunction with pointwise loss.

BSC loss leverages in-batch negatives via softmax normalization, functioning as implicit data augmentation and obviating the need for cross-encoder labeling, thus offering a unified and robust framework for sentence embedding fine-tuning in diverse pairwise scoring settings (Chernyavskiy et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Batch-Softmax Contrastive Loss (BSC).