Sigmoid-Based Contrastive Loss

Updated 6 May 2026

Sigmoid-based contrastive loss is a novel loss function that replaces softmax with independent logistic regression to handle multi-label and set-similarity tasks.
Its pairwise logistic formulation with a trainable temperature and bias enables robust multi-modal learning, as demonstrated in applications like ECG classification.
Empirical studies show that careful tuning of temperature and bias leads to improved precision and recall, effectively addressing limitations of traditional contrastive losses.

A sigmoid-based contrastive loss is a loss function for representation learning that generalizes classic softmax-normalized contrastive objectives by replacing the softmax with a pairwise logistic (sigmoid) loss. This construction removes the constraint of assigning exactly one positive match per anchor, enabling direct accommodation of multi-label and set-similarity relationships in contrastive learning pipelines. It has particular utility for multi-label scenarios such as medical ECG classification with co-occurring pathologies, where strict one-to-one positive-negative partitioning fails to capture the problem structure (Takahashi et al., 11 Feb 2026).

1. From Softmax-Based to Sigmoid-Based Contrastive Loss

The traditional CLIP contrastive loss and similar InfoNCE objectives use a softmax and cross-entropy formulation, which is well-suited for one-to-one correspondence tasks (e.g., a single image–caption pair per sample). Specifically, for a batch of $n$ image–text pairs, embeddings $(z_i^{img}, z_j^{txt})$ are compared using $L_{ij} = (1/\tau)\langle z_i^{img}, z_j^{txt}\rangle$ . The loss is computed by comparing every anchor to all candidates and penalizing all off-diagonal entries as strict negatives via a softmax, yielding:

$L_{\mathrm{CLIP}} = \frac{1}{2n}\left[ \sum_i \mathrm{CE}(\mathrm{softmax}(L_{i\cdot}), i) + \sum_i \mathrm{CE}(\mathrm{softmax}(L_{\cdot i}), i) \right]$

The resulting formulation enforces that only the matched pair is positive, with all other examples treated as negatives. This paradigm is not naturally compatible with multi-label data or scenarios with ambiguous semantic overlap (Zhai et al., 2023, Çağatan, 2024).

Sigmoid-based contrastive loss, as introduced in SigLIP and generalizations such as SigCLR, replaces categorical softmax targets with independent logistic regression for each pair:

$L_{\mathrm{sigmoid}} = - \frac{1}{n^2} \sum_{i=1}^n \sum_{j=1}^n \mathrm{Label}_{ij}\log\sigma(\mathrm{Label}_{ij}\cdot \mathrm{logits}_{ij}),$

where $\sigma(x)=1/(1+e^{-x})$ and the label matrix $\mathrm{Label}_{ij}\in\{+1,-1\}$ encodes positive and negative associations. This decoupled structure allows the model to learn from arbitrary numbers of positives per anchor, with each pair scored independently (Zhai et al., 2023, Lee et al., 2024).

2. Mathematical Details and Extensions

Sigmoid-based contrastive loss relies on:

$z_i^{img}, z_j^{txt}$ : $\ell_2$ -normalized embeddings in $\mathbb{R}^d$
$(z_i^{img}, z_j^{txt})$ 0, with $(z_i^{img}, z_j^{txt})$ 1 (trainable temperature) and bias $(z_i^{img}, z_j^{txt})$ 2 (often trainable)
$(z_i^{img}, z_j^{txt})$ 3: typically $(z_i^{img}, z_j^{txt})$ 4 for matched pairs and $(z_i^{img}, z_j^{txt})$ 5 otherwise, but extendable to soft labels

For multi-label data, the strict identity matrix for positives is replaced by a similarity matrix $(z_i^{img}, z_j^{txt})$ 6 encoding pairwise label overlap. For instance, in ECG applications, the Jaccard index $(z_i^{img}, z_j^{txt})$ 7 can be used:

$(z_i^{img}, z_j^{txt})$ 8

where $(z_i^{img}, z_j^{txt})$ 9 are sets of labels attached to samples $L_{ij} = (1/\tau)\langle z_i^{img}, z_j^{txt}\rangle$ 0 and $L_{ij} = (1/\tau)\langle z_i^{img}, z_j^{txt}\rangle$ 1. The loss then naturally interpolates between strict negatives and full positives based on partial label overlap, and loss minimization explicitly encourages alignment of representations for pairs that share pathologies or tags (Takahashi et al., 11 Feb 2026).

3. Theoretical Properties and Embedding Geometry

The sigmoid-based contrastive loss is governed by the geometry of the learned embedding space and the role of the temperature parameter. The optimal embedding under this loss can be characterized using the double-Constant Embedding Model (CCEM). In this framework, positive pairs are forced to a fixed inner product (alignment), while negatives align to another (repulsion). As temperature increases, the optimal configuration approaches a simplex equiangular tight frame (ETF), where embeddings are maximally well-spread; as it decreases, the embedding collapses toward an antipodal structure (positives are diametrically opposed) (Lee et al., 2024, Bangachev et al., 23 Sep 2025).

Key results include:

For large $L_{ij} = (1/\tau)\langle z_i^{img}, z_j^{txt}\rangle$ 2, the minimum is at $L_{ij} = (1/\tau)\langle z_i^{img}, z_j^{txt}\rangle$ 3 (ETF regime), yielding uniformly separated embeddings.
For small $L_{ij} = (1/\tau)\langle z_i^{img}, z_j^{txt}\rangle$ 4, the minimum is at $L_{ij} = (1/\tau)\langle z_i^{img}, z_j^{txt}\rangle$ 5 (antipodal/collapse).
Intermediate $L_{ij} = (1/\tau)\langle z_i^{img}, z_j^{txt}\rangle$ 6 values interpolate between these regimes.

These properties ensure that with appropriate temperature scaling and bias training, the model can achieve robust alignment of multimodal pairs and scale to large negatives without global softmax normalization (Bangachev et al., 23 Sep 2025). The theory of $L_{ij} = (1/\tau)\langle z_i^{img}, z_j^{txt}\rangle$ 7-Constellations formalizes zero-loss solutions, providing practical guidance for margin and bias parameterization in large-scale multi-modal contrastive learning.

4. Practical Implementation and Training Considerations

Implementations use the following steps:

Embed and normalize both modalities (e.g., images/text, ECG/written findings).
Compute pairwise logits: $L_{ij} = (1/\tau)\langle z_i^{img}, z_j^{txt}\rangle$ 8.
Construct the label (or similarity) matrix (identity for single-label; Jaccard or another set-similarity for multi-label tasks).
Compute loss as the (mean) negative log likelihood under the sigmoid, weighted by the (possibly soft) label matrix.

Pseudocode for Jaccard-weighted sigmoid loss (for multi-label ECG classification) is: $L_{\mathrm{CLIP}} = \frac{1}{2n}\left[ \sum_i \mathrm{CE}(\mathrm{softmax}(L_{i\cdot}), i) + \sum_i \mathrm{CE}(\mathrm{softmax}(L_{\cdot i}), i) \right]$ 4 (Takahashi et al., 11 Feb 2026)

Hyperparameters include Adam optimizer ( $L_{ij} = (1/\tau)\langle z_i^{img}, z_j^{txt}\rangle$ 9), temperature initialization, optional bias (often $L_{\mathrm{CLIP}} = \frac{1}{2n}\left[ \sum_i \mathrm{CE}(\mathrm{softmax}(L_{i\cdot}), i) + \sum_i \mathrm{CE}(\mathrm{softmax}(L_{\cdot i}), i) \right]$ 0 initially), and batch size typically 64 or 128. Training schedule, embedding dimensionality, and augmentation (e.g., random waveform cropping for ECGs) have demonstrated significant effects on downstream performance. Learnable bias is critical to offset the imbalance between the number of negatives and positives (Zhai et al., 2023, Çağatan, 2024).

5. Empirical Results and Comparative Analysis

Sigmoid-based contrastive loss (including variants such as SigLIP and SigCLR) displays competitive or superior performance across a range of settings:

In multi-label ECG classification, the Jaccard-weighted extension enhances F1 and Jaccard index over the baseline SigLIP and substantially outperforms a pure ResNet-1D multilabel classifier (Takahashi et al., 11 Feb 2026).
In visual SSL settings, SigCLR matches or exceeds SimCLR, especially at smaller batch sizes ( $L_{\mathrm{CLIP}} = \frac{1}{2n}\left[ \sum_i \mathrm{CE}(\mathrm{softmax}(L_{i\cdot}), i) + \sum_i \mathrm{CE}(\mathrm{softmax}(L_{\cdot i}), i) \right]$ 1– $L_{\mathrm{CLIP}} = \frac{1}{2n}\left[ \sum_i \mathrm{CE}(\mathrm{softmax}(L_{i\cdot}), i) + \sum_i \mathrm{CE}(\mathrm{softmax}(L_{\cdot i}), i) \right]$ 2), demonstrating high data efficiency and batch size decoupling (Çağatan, 2024).
Learnable bias prevents negative dominance, which otherwise collapses training at higher temperature settings.
For multi-modal tasks, training both temperature and relative bias unlocks the full family of nearly zero-loss solutions, facilitating robust, margin-rich retrieval and minimizing modality gap artifacts (Bangachev et al., 23 Sep 2025).

Performance Table (excerpted from (Takahashi et al., 11 Feb 2026)):

Metric	Standard SigLIP	+Jaccard loss
Hamming Loss	0.0665	0.0451
Precision (μ)	0.5067	0.3147
Recall (μ)	0.0365	0.3020
F1 Score (μ)	0.0681	0.3082
Jaccard Index	0.0373	0.0858

6. Broader Applicability and Theoretical Insights

Sigmoid-based contrastive loss generalizes InfoNCE to arbitrary numbers of positive associations per example and can support nuanced soft similarity weighting based on any available label- or ontology-driven metric (e.g., cosine similarity, hierarchical ontologies). This flexibility makes it suitable for clinical data with co-morbidities, images with multiple tags, or any context where positive/negative dichotomies are insufficient (Takahashi et al., 11 Feb 2026).

Key theoretical properties also inform best practices:

Batch construction influences embedding collapse and multi-label utilization.
Temperature warmup and careful bias initialization are essential for stable optimization.
Out-of-domain robustness to data drift has been demonstrated, with only minor F1 drop on distribution shift for ECGs (Takahashi et al., 11 Feb 2026).

The combinatorial geometry of $L_{\mathrm{CLIP}} = \frac{1}{2n}\left[ \sum_i \mathrm{CE}(\mathrm{softmax}(L_{i\cdot}), i) + \sum_i \mathrm{CE}(\mathrm{softmax}(L_{\cdot i}), i) \right]$ 3-Constellations enables explicit control over the margin and bias needed to ensure uniqueness and robustness of retrieval, especially when synchronizing multiple modalities or pre-trained encoders (Bangachev et al., 23 Sep 2025).

7. Limitations and Open Directions

While sigmoid-based contrastive objectives offer substantial practical and theoretical advantages, certain limitations remain:

Sensitivity to temperature and bias scaling requires careful tuning or implicit schedule design.
In extremely large-scale settings, the behavior with fixed temperature has yet to be fully established (e.g., scaling to ImageNet-1k) (Çağatan, 2024).
In low-data regimes or high imbalance cases (many more negatives than positives), insufficient biasing can hinder positive learning.
Complex set-similarity or ontology-driven label matrices introduce additional hyperparameters and computational costs in the pairwise label computation.

In summary, sigmoid-based contrastive loss (in both hard- and soft-label formulations) supplies a conceptually simple and computationally efficient foundation for multi-label, multi-modal, and scalable contrastive learning. Its label matrix parameterization, decoupling from batch-size constraints, and compatibility with set-based similarity metrics distinguish it from classic softmax-based InfoNCE frameworks and enable it to address emerging challenges in clinical and general representation learning (Takahashi et al., 11 Feb 2026, Lee et al., 2024, Bangachev et al., 23 Sep 2025, Çağatan, 2024, Zhai et al., 2023).