Sigmoid Cross-Entropy Loss

Updated 31 March 2026

Sigmoid cross-entropy loss is defined as a binary logistic loss function that computes independent pairwise classifications in contrastive learning.
Its formulation, with a fixed temperature and learnable bias, ensures gradient stability and performance especially in small batch regimes.
The method simplifies implementations compared to softmax-based objectives and demonstrates competitive accuracy on vision benchmarks.

The sigmoid cross-entropy loss, often termed logistic loss, is a key objective in modern contrastive representation learning. Departing from softmax-based losses that globally normalize over all negatives, the sigmoid cross-entropy loss treats each pair of examples as an independent binary classification problem. This approach, as instantiated in recent frameworks such as SigCLR, enables competitive or superior performance to InfoNCE-based methods, especially in regimes constrained by batch size or distributed memory, and introduces specific architectural and optimization considerations (Çağatan, 2024).

1. Mathematical Formulation

Let $\mathcal{B}$ denote a minibatch of $N$ samples, each augmented independently twice to produce $2N$ views. Embedding these via a shared encoder and projector yields vectors $z_1, \ldots, z_{2N} \in \mathbb{R}^d$ . Pairwise similarity is computed via dot product or cosine similarity: $\operatorname{sim}(z_i, z_j) = z_i \cdot z_j$ (normalized if desired). Each pair receives a binary label: $z_{ij}=+1$ if $(i,j)$ forms a positive pair (two views of the same image), $z_{ij}=-1$ otherwise; trivial diagonal pairs ( $i=j$ ) are masked out.

The loss for each pair $(i,j)$ is defined by introducing a temperature $t > 0$ and a learnable bias $b$ :

$a_{ij} = t \cdot \operatorname{sim}(z_i, z_j) + b$

The logistic loss for the pair is then:

$L_{ij} = \log\left(1 + \exp(-z_{ij} \cdot a_{ij})\right) = -\log \sigma(z_{ij} \cdot a_{ij})$

where $\sigma(u)=1/(1+\exp(-u))$ denotes the sigmoid function. The batch-averaged loss over all valid, off-diagonal pairs is:

$\mathcal{L} = \frac{1}{\sum_{i,j} k_{ij}} \sum_{i=1}^{2N} \sum_{j=1}^{2N} k_{ij} \log\left(1 + \exp(-z_{ij}(t z_i \cdot z_j + b))\right)$

where $k_{ij}=0$ when $i=j$ (diagonal), $k_{ij}=1$ otherwise.

2. Temperature and Bias: Roles and Optimization

Temperature $t$ modulates the steepness of the sigmoid, akin to the $1/\tau$ scaling of logits in softmax-based InfoNCE. The learnable bias $b$ , initialized to a large negative value (e.g., $-10$ ), quickly neutralizes initial class imbalance by ensuring most pairwise logits produce near-zero gradients at initialization—crucial due to the dominance of negative pairs ( $\#\text{neg} \gg \#\text{pos}$ ). As training proceeds, $b$ is updated via backpropagation.

Empirical ablations on CIFAR-10 demonstrate that both a fixed temperature and a learnable bias are essential for stable and high-performing training. Allowing the temperature to be freely learnable degrades accuracy, and omitting the bias leads to catastrophic collapse, especially at higher temperature values (Çağatan, 2024).

3. Comparison with Softmax/InfoNCE Objectives

Softmax-based (InfoNCE) losses compute, for each anchor sample, a global normalization over all other samples via a partition sum $\sum_{k\ne i} \exp(\operatorname{sim}(i,k)/\tau)$ . This coupling necessitates large batch sizes for the denominator to provide sufficiently diverse negatives, and complex global operations in implementation.

In contrast, the sigmoid cross-entropy loss decomposes into independent binary classification subproblems for each pair: no partition function, no global trade-off between positives and a joint negative pool. Advantages include:

Simplicity: Code is more straightforward; no need to compute the global normalization.
Small-Batch Robustness: Each negative pair provides a learning signal even in small batches.
Gradient Stability: The learnable bias compensates for negative-to-positive ratio skew, providing strong and stable gradients from the first step.
Empirical Performance: As shown in SigLIP for language–image and SigCLR for vision, sigmoid losses with learnable bias can match or outperform softmax for contrastive objectives (Çağatan, 2024).

4. Practical Implementation and Computational Considerations

Data Construction and Sampling

Each training iteration forms $2N$ views via independent augmentations of $N$ images. After encoding, embeddings $z_1,\ldots,z_{2N}$ are pair-indexed as follows: for each view $i$ , its positive $j$ is the twin augmentation of the same image ( $z_{ij}=+1$ ); all other off-diagonal pairs receive $z_{ij}=-1$ .

Efficient Computation

Vectorized implementation employs two $2N \times 2N$ masks:

sim_mask: Pair labels $\in \{+1, -1\}$
loss_mask: Off-diagonal indicator $\in\{1, 0\}$

For all pairs, compute:

$S = t \cdot \operatorname{cos\_sim}(Z, Z) + b$ (scaled+shifted similarities)
$L_{ij} = -\log \sigma(\text{sim\_mask} \times S)$ (element-wise)
$\mathcal{L} = \text{mean}(L_{ij} \cdot \text{loss\_mask})$

Multi-Device Training

SigCLR supports distributed training where, on $D$ devices with local batch size $b=2N/D$ , only local computations and $O(b^2)$ memory per device are needed. Cross-device alignment is achieved by rotating encoded embeddings among devices, ensuring all possible negative pairs are included over $D$ rounds but never requiring full-batch all-gather. The resulting gradients are mathematically equivalent to classic full-batch loss, but memory and communication requirements are reduced (Çağatan, 2024).

Hyperparameters

Temperature: Fixed $t \in \{1,2,5,10\}$ ; $t=5$ empirically optimal for CIFAR.
Bias: Learnable $b$ , initialized to $-10$ .
Architecture: ResNet-18 backbone; 3-layer MLP projector (1024–1024–128 dims).
Training Regime: 1000 epochs, LARS optimizer with cosine warmup, learning rate $0.3$, batch sizes $64$–$1024$.

5. Empirical Evaluation and Comparative Results

Linear Probe Performance

Dataset	SimCLR* (Repro)	SigCLR
CIFAR-10	91.69	91.77
CIFAR-100	65.49	66.98
Tiny-IN	48.16	48.94

Batch Size Robustness (CIFAR-10/100)

Batch Size	SimCLR* (C10)	SigCLR (C10)	SimCLR* (C100)	SigCLR (C100)
64	90.56	91.26	62.85	66.52
128	91.69	91.77	65.49	66.98
256	92.23	92.11	66.67	67.86
512	92.42	92.59	67.26	68.57
1024	92.26	92.62	66.49	68.58

This comparison establishes that SigCLR matches or marginally surpasses SimCLR softmax-based InfoNCE on established benchmarks. SigCLR is notably more robust to small batch sizes, retaining high accuracy where SimCLR degrades.

Ablation: Temperature and Bias

Setup	$t=1$	$t=2$	$t=5$	$t=10$
Fixed $t$ + learnable $b$	90.76	91.25	91.53	89.80
Learnable $t$ + $b$	84.16	84.11	84.07	84.67
Fixed $t$ , no bias	87.17	84.22	27.15	17.37

This suggests that the fixed temperature and learnable bias are both essential for reliable optimization and avoiding collapse.

6. Key Properties and Applications

Sigmoid cross-entropy loss is particularly suited for contrastive representation learning settings with:

Limited hardware resources: The lack of global normalization enables effective learning even with small batch sizes or limited-device scenarios.
Scalability requirements: Efficient multi-device implementation reduces communication overhead and memory usage, facilitating large-scale training.
Class imbalance sensitivity: The learnable bias instantly counteracts the vast excess of negative pairs, promoting more stable, effective representation learning.

Domains include visual representation learning (SigCLR) and, as demonstrated by SigLIP, language–image pretraining. A plausible implication is that broader adoption could occur in any context where per-pair scalability and imbalance mitigation outweigh the potential benefits of global negative normalization (Çağatan, 2024).

7. Summary and Outlook

The sigmoid cross-entropy loss, as rigorously instantiated by SigCLR, offers an alternative to global softmax-based contrastive objectives by formulating representation alignment exclusively as a dense set of independent binary classification tasks. Augmented by a learnable bias and fixed temperature, this objective achieves or exceeds the performance of InfoNCE-based methods on small- to medium-scale vision benchmarks and offers improved robustness and scalability in hardware-constrained or distributed environments. Future work will likely extend the paradigm to larger-scale and multimodal domains, further exploring efficient distributed inference and training regimes (Çağatan, 2024).

Markdown Report Issue Upgrade to Chat

References (1)

SigCLR: Sigmoid Contrastive Learning of Visual Representations (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sigmoid Cross-Entropy Loss.