Sigmoid Cross-Entropy Loss
- Sigmoid cross-entropy loss is defined as a binary logistic loss function that computes independent pairwise classifications in contrastive learning.
- Its formulation, with a fixed temperature and learnable bias, ensures gradient stability and performance especially in small batch regimes.
- The method simplifies implementations compared to softmax-based objectives and demonstrates competitive accuracy on vision benchmarks.
The sigmoid cross-entropy loss, often termed logistic loss, is a key objective in modern contrastive representation learning. Departing from softmax-based losses that globally normalize over all negatives, the sigmoid cross-entropy loss treats each pair of examples as an independent binary classification problem. This approach, as instantiated in recent frameworks such as SigCLR, enables competitive or superior performance to InfoNCE-based methods, especially in regimes constrained by batch size or distributed memory, and introduces specific architectural and optimization considerations (ĆaÄatan, 2024).
1. Mathematical Formulation
Let denote a minibatch of samples, each augmented independently twice to produce $2N$ views. Embedding these via a shared encoder and projector yields vectors . Pairwise similarity is computed via dot product or cosine similarity: (normalized if desired). Each pair receives a binary label: if forms a positive pair (two views of the same image), otherwise; trivial diagonal pairs () are masked out.
The loss for each pair is defined by introducing a temperature and a learnable bias :
The logistic loss for the pair is then:
where denotes the sigmoid function. The batch-averaged loss over all valid, off-diagonal pairs is:
where when (diagonal), otherwise.
2. Temperature and Bias: Roles and Optimization
Temperature modulates the steepness of the sigmoid, akin to the scaling of logits in softmax-based InfoNCE. The learnable bias , initialized to a large negative value (e.g., ), quickly neutralizes initial class imbalance by ensuring most pairwise logits produce near-zero gradients at initializationācrucial due to the dominance of negative pairs (). As training proceeds, is updated via backpropagation.
Empirical ablations on CIFAR-10 demonstrate that both a fixed temperature and a learnable bias are essential for stable and high-performing training. Allowing the temperature to be freely learnable degrades accuracy, and omitting the bias leads to catastrophic collapse, especially at higher temperature values (ĆaÄatan, 2024).
3. Comparison with Softmax/InfoNCE Objectives
Softmax-based (InfoNCE) losses compute, for each anchor sample, a global normalization over all other samples via a partition sum . This coupling necessitates large batch sizes for the denominator to provide sufficiently diverse negatives, and complex global operations in implementation.
In contrast, the sigmoid cross-entropy loss decomposes into independent binary classification subproblems for each pair: no partition function, no global trade-off between positives and a joint negative pool. Advantages include:
- Simplicity: Code is more straightforward; no need to compute the global normalization.
- Small-Batch Robustness: Each negative pair provides a learning signal even in small batches.
- Gradient Stability: The learnable bias compensates for negative-to-positive ratio skew, providing strong and stable gradients from the first step.
- Empirical Performance: As shown in SigLIP for languageāimage and SigCLR for vision, sigmoid losses with learnable bias can match or outperform softmax for contrastive objectives (ĆaÄatan, 2024).
4. Practical Implementation and Computational Considerations
Data Construction and Sampling
Each training iteration forms $2N$ views via independent augmentations of images. After encoding, embeddings are pair-indexed as follows: for each view , its positive is the twin augmentation of the same image (); all other off-diagonal pairs receive .
Efficient Computation
Vectorized implementation employs two masks:
- sim_mask: Pair labels
- loss_mask: Off-diagonal indicator
For all pairs, compute:
- (scaled+shifted similarities)
- (element-wise)
Multi-Device Training
SigCLR supports distributed training where, on devices with local batch size , only local computations and memory per device are needed. Cross-device alignment is achieved by rotating encoded embeddings among devices, ensuring all possible negative pairs are included over rounds but never requiring full-batch all-gather. The resulting gradients are mathematically equivalent to classic full-batch loss, but memory and communication requirements are reduced (ĆaÄatan, 2024).
Hyperparameters
- Temperature: Fixed ; empirically optimal for CIFAR.
- Bias: Learnable , initialized to .
- Architecture: ResNet-18 backbone; 3-layer MLP projector (1024ā1024ā128 dims).
- Training Regime: 1000 epochs, LARS optimizer with cosine warmup, learning rate $0.3$, batch sizes $64$ā$1024$.
5. Empirical Evaluation and Comparative Results
Linear Probe Performance
| Dataset | SimCLR* (Repro) | SigCLR |
|---|---|---|
| CIFAR-10 | 91.69 | 91.77 |
| CIFAR-100 | 65.49 | 66.98 |
| Tiny-IN | 48.16 | 48.94 |
Batch Size Robustness (CIFAR-10/100)
| Batch Size | SimCLR* (C10) | SigCLR (C10) | SimCLR* (C100) | SigCLR (C100) |
|---|---|---|---|---|
| 64 | 90.56 | 91.26 | 62.85 | 66.52 |
| 128 | 91.69 | 91.77 | 65.49 | 66.98 |
| 256 | 92.23 | 92.11 | 66.67 | 67.86 |
| 512 | 92.42 | 92.59 | 67.26 | 68.57 |
| 1024 | 92.26 | 92.62 | 66.49 | 68.58 |
This comparison establishes that SigCLR matches or marginally surpasses SimCLR softmax-based InfoNCE on established benchmarks. SigCLR is notably more robust to small batch sizes, retaining high accuracy where SimCLR degrades.
Ablation: Temperature and Bias
| Setup | ||||
|---|---|---|---|---|
| Fixed + learnable | 90.76 | 91.25 | 91.53 | 89.80 |
| Learnable + | 84.16 | 84.11 | 84.07 | 84.67 |
| Fixed , no bias | 87.17 | 84.22 | 27.15 | 17.37 |
This suggests that the fixed temperature and learnable bias are both essential for reliable optimization and avoiding collapse.
6. Key Properties and Applications
Sigmoid cross-entropy loss is particularly suited for contrastive representation learning settings with:
- Limited hardware resources: The lack of global normalization enables effective learning even with small batch sizes or limited-device scenarios.
- Scalability requirements: Efficient multi-device implementation reduces communication overhead and memory usage, facilitating large-scale training.
- Class imbalance sensitivity: The learnable bias instantly counteracts the vast excess of negative pairs, promoting more stable, effective representation learning.
Domains include visual representation learning (SigCLR) and, as demonstrated by SigLIP, languageāimage pretraining. A plausible implication is that broader adoption could occur in any context where per-pair scalability and imbalance mitigation outweigh the potential benefits of global negative normalization (ĆaÄatan, 2024).
7. Summary and Outlook
The sigmoid cross-entropy loss, as rigorously instantiated by SigCLR, offers an alternative to global softmax-based contrastive objectives by formulating representation alignment exclusively as a dense set of independent binary classification tasks. Augmented by a learnable bias and fixed temperature, this objective achieves or exceeds the performance of InfoNCE-based methods on small- to medium-scale vision benchmarks and offers improved robustness and scalability in hardware-constrained or distributed environments. Future work will likely extend the paradigm to larger-scale and multimodal domains, further exploring efficient distributed inference and training regimes (ĆaÄatan, 2024).