Papers
Topics
Authors
Recent
2000 character limit reached

Dual Sigmoid Loss Function

Updated 8 January 2026
  • The paper demonstrates that dual sigmoid functions independently scale intra-class (pulling) and inter-class (pushing) gradients to optimize compactness and separation.
  • It shows improved noise robustness and convergence in tasks like face recognition and multi-modal representation alignment through sigmoid-based reweighting.
  • Empirical ablation results reveal that the full sigmoid approach consistently outperforms constant scaling and step functions in deep metric learning scenarios.

A dual sigmoid-based loss function denotes any loss design employing two distinct sigmoid functions to scale, modulate, or reweight intra-class and inter-class terms—primarily in metric learning, representation learning, and robust classification. Its central paradigm is to independently modulate the optimization drive on “pulling” (same-class) and “pushing” (different-class) sample pairs, explicitly balancing within-class compactness and between-class separability under noise, class imbalance, or multi-modal data sources. Notable instantiations include the SFace “sigmoid-constrained hypersphere” loss—designed for robust face recognition—and the bias-and-temperature-reparametrized sigmoid contrastive loss used in SigLIP/SigLIP2 for multi-modal representation alignment (Zhong et al., 2022, Bangachev et al., 23 Sep 2025).

1. Mathematical Formulation

Dual sigmoid-based losses combine two independently parameterized sigmoid functions. In the case of SFace loss for face recognition, sample ii with embedding xix_i and class center WyiW_{y_i} (all 2\ell_2-normalized) produces intra- and inter-class angular terms: θyi=arccos(Wyixi),θj=arccos(Wjxi),jyi\theta_{y_i} = \arccos(W_{y_i}^\top x_i), \qquad \theta_j = \arccos(W_j^\top x_i),\quad j \neq y_i The per-sample loss is: LSFace(xi,yi)=Lintra(θyi)+Linter({θj}jyi)L_\mathrm{SFace}(x_i, y_i) = L_\mathrm{intra}(\theta_{y_i}) + L_\mathrm{inter}(\{\theta_j\}_{j\neq y_i}) where: Lintra(θyi)=[rintra(θyi)]bcosθyi Linter({θj})=jyi[rinter(θj)]bcosθj\begin{aligned} L_\mathrm{intra}(\theta_{y_i}) & = -[r_\mathrm{intra}(\theta_{y_i})]_b \cos\theta_{y_i} \ L_\mathrm{inter}(\{\theta_j\}) & = \sum_{j\ne y_i} [r_\mathrm{inter}(\theta_j)]_b \cos\theta_j \end{aligned} with []b[\cdot]_b denoting “block-gradient” (stop-gradient). The scaling functions are sigmoids: rintra(θ)=s1+exp(k[θa]) rinter(θ)=s1+exp(+k[θb])\begin{aligned} r_\mathrm{intra}(\theta) &= \frac{s}{1 + \exp(-k[\theta - a])} \ r_\mathrm{inter}(\theta) &= \frac{s}{1 + \exp(+k[\theta - b])} \end{aligned} where s>0s>0 is a scale (commonly $64$), k>0k>0 is the sharpness, and a,ba, b set the transition point for intra- and inter-class modulation, respectively (Zhong et al., 2022).

In Sigmoid Contrastive Loss as introduced in SigLIP/SigLIP2 (Bangachev et al., 23 Sep 2025), two parameters (tt as inverse temperature and bb as bias, or brel=b/tb_\mathrm{rel}=b/t as relative bias) control the separation of positive (matched) and negative (unmatched) pairs: L(θ,ϕ;t,brel)=i=1Nlog(1+exp[t(Ui,Vibrel)]) +ijlog(1+exp[t(Ui,Vjbrel)])\begin{aligned} L(\theta,\phi; t,b_\mathrm{rel}) = &\sum_{i=1}^N \log\bigl(1+\exp[-t(\langle U_i, V_i \rangle-b_\mathrm{rel})]\bigr) \ &+ \sum_{i\ne j} \log\bigl(1+\exp[t(\langle U_i, V_j \rangle - b_\mathrm{rel})]\bigr) \end{aligned}

2. Motivation and Geometric Intuition

Traditional loss strategies, such as Center Loss or softmax variants, enforce fixed rates for intra-class collapse and inter-class separation, often leading to overfitting or sensitivity to outliers and label noise. Dual sigmoid-based losses address these issues by:

  • Applying moderate rather than absolute optimization pressure to examples already close to the class center or already well separated, mitigating overfitting especially on noisy or low-quality data (Zhong et al., 2022).
  • Allowing the training signal to dynamically diminish as optimization achieves compactness or separation, leading to stability and resistance to “freezing” in intra-class variance minimization as seen in Center Loss (Grassa et al., 2020).

Geometric analysis reveals that, for hypersphere-based dual sigmoid loss, gradient magnitudes fall smoothly to zero in regions where further pulling/pushing provides little benefit, constraining model updates to the tangent spaces of the hypersphere and promoting both compactness and robustness.

3. Hyperparameterization and Gradient Properties

Each sigmoid scaling function is parameterized as follows:

Parameter Role Typical Range
ss Maximum scale (“speed”) 64\sim64
kk Sharpness (“sigmoid slope”) 80\sim80
aa Intra-class center [0.8,0.84] (rad)[0.8, 0.84]\ (\mathrm{rad})
bb Inter-class center 1.28 (rad)\sim1.28\ (\mathrm{rad})

Tuning aa upward (intra-class) makes strong pulling “turn off” for larger angular errors—beneficial under high label noise. bb is relatively stable.

Gradient computation proceeds via

LSFacexi=rintra(θyi)cosθyixi+jyirinter(θj)cosθjxi\frac{\partial L_{\mathrm{SFace}}}{\partial x_i} = - r_{\mathrm{intra}}(\theta_{y_i})\frac{\partial \cos\theta_{y_i}}{\partial x_i} + \sum_{j\neq y_i} r_{\mathrm{inter}}(\theta_j)\frac{\partial \cos\theta_j}{\partial x_i}

ensuring that re-scaling applies only to component-wise cosine gradients, not to the parameters of the sigmoid re-scale functions themselves.

4. Comparative Performance and Ablation Analysis

Empirical ablation indicates that sigmoid re-scaling of both gradient terms outperforms both constant and piecewise-hard-threshold alternatives in deep face recognition:

  • Constant scaling: 90.05% five-set mean accuracy
  • Piecewise (step function): 94.64%
  • Full sigmoid (SFace): 94.80%

The dual sigmoid approach maintains a larger gap in intra-class angular distribution between clean and noisy subsets (e.g., Δ7.6\Delta\approx 7.6^\circ for SFace vs. 4.84.8^\circ for ArcFace), evidencing greater noise robustness (Zhong et al., 2022).

On synthetic “random-sphere” data, employing explicit tt and brelb_\mathrm{rel} parameters leads to faster convergence and larger separation margins, thereby improving retrieval robustness in contrastive representation learning (Bangachev et al., 23 Sep 2025).

5. Extension to Multi-Modal and Contrastive Frameworks

In contrastive settings (e.g., SigLIP), the temperature (tt) and bias (brelb_\mathrm{rel}) parameters implicitly act as global sigmoidal thresholds, shaping the distributions of intra- and inter-modal similarity scores. The concept of (m,brel)(m, b_\mathrm{rel})–Constellations provides a rigorous combinatorial-geometric underpinning: global minimizers correspond to configurations where all positive pairs are separated from negatives by a margin mm shifted by brelb_\mathrm{rel}.

In practice:

  • Large tt enforces sharp decision boundaries
  • Trainable brelb_\mathrm{rel} aligns positive and negative distributions, crucial for automatic separation across modalities and for closing the modality gap (Bangachev et al., 23 Sep 2025).
  • Freezing brelb_\mathrm{rel} enables explicit control of the induced margin; adapter-like effects can be achieved without additional network parameters.

6. Robustness to Dataset Noise and Implementation Guidelines

Dual sigmoid-based losses excel in environments where class labels are imperfect or some samples are systematically noisy. Practical recommendations include tuning the intra-class sigmoid center aa rightward (higher angle) as noise increases, and maintaining high kk for sharp control. Inter-class center bb is stable. This moderation prevents the over-pulling of noisy or uncertain instances, and prevents over-pushing once inter-class separability is sufficient.

The block-gradient variant is essential: it avoids back-propagating through the scaling functions themselves, maintaining the intended per-sample re-weighting.

Implementation typically requires:

  • 2\ell_2 normalization of both features and class weights
  • Per-sample computation of angular distances and cosine similarities
  • Independent sigmoid re-scales with block-gradient semantics
  • Batchwise aggregation and update

7. Theoretical Characterization and Geometric Limits

Under dual sigmoid-based regimes with free temperature and bias (contrastive), global minima are characterized by strict separation of positive and negative similarities—quantified by the (m,brel)(m, b_\mathrm{rel})–Constellation formalism. Combinatorial geometric arguments (via spherical codes) relate the embedding dimension dd, achievable margin mm, and dataset cardinality NN, prescribing fundamental limits on deployable capacity and separation. Such dual-sigmoid structures explain empirically observed modality gaps and inform architectural or regularization choices (Bangachev et al., 23 Sep 2025).


The dual sigmoid-based loss function paradigm is now a core methodological tool in robust metric learning and representation alignment, facilitating nuanced control over intra-class compactness and inter-class separation, with demonstrated benefits for noise-robustness, convergence, and multi-modal retrieval accuracy (Zhong et al., 2022, Bangachev et al., 23 Sep 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Dual Sigmoid-Based Loss Function.