Dual Sigmoid Loss Function

Updated 8 January 2026

The paper demonstrates that dual sigmoid functions independently scale intra-class (pulling) and inter-class (pushing) gradients to optimize compactness and separation.
It shows improved noise robustness and convergence in tasks like face recognition and multi-modal representation alignment through sigmoid-based reweighting.
Empirical ablation results reveal that the full sigmoid approach consistently outperforms constant scaling and step functions in deep metric learning scenarios.

A dual sigmoid-based loss function denotes any loss design employing two distinct sigmoid functions to scale, modulate, or reweight intra-class and inter-class terms—primarily in metric learning, representation learning, and robust classification. Its central paradigm is to independently modulate the optimization drive on “pulling” (same-class) and “pushing” (different-class) sample pairs, explicitly balancing within-class compactness and between-class separability under noise, class imbalance, or multi-modal data sources. Notable instantiations include the SFace “sigmoid-constrained hypersphere” loss—designed for robust face recognition—and the bias-and-temperature-reparametrized sigmoid contrastive loss used in SigLIP/SigLIP2 for multi-modal representation alignment (Zhong et al., 2022, Bangachev et al., 23 Sep 2025).

1. Mathematical Formulation

Dual sigmoid-based losses combine two independently parameterized sigmoid functions. In the case of SFace loss for face recognition, sample $i$ with embedding $x_i$ and class center $W_{y_i}$ (all $\ell_2$ -normalized) produces intra- and inter-class angular terms: $\theta_{y_i} = \arccos(W_{y_i}^\top x_i), \qquad \theta_j = \arccos(W_j^\top x_i),\quad j \neq y_i$ The per-sample loss is: $L_\mathrm{SFace}(x_i, y_i) = L_\mathrm{intra}(\theta_{y_i}) + L_\mathrm{inter}(\{\theta_j\}_{j\neq y_i})$ where: $\begin{aligned} L_\mathrm{intra}(\theta_{y_i}) & = -[r_\mathrm{intra}(\theta_{y_i})]_b \cos\theta_{y_i} \ L_\mathrm{inter}(\{\theta_j\}) & = \sum_{j\ne y_i} [r_\mathrm{inter}(\theta_j)]_b \cos\theta_j \end{aligned}$ with $[\cdot]_b$ denoting “block-gradient” (stop-gradient). The scaling functions are sigmoids: $\begin{aligned} r_\mathrm{intra}(\theta) &= \frac{s}{1 + \exp(-k[\theta - a])} \ r_\mathrm{inter}(\theta) &= \frac{s}{1 + \exp(+k[\theta - b])} \end{aligned}$ where $s>0$ is a scale (commonly $64$), $k>0$ is the sharpness, and $a, b$ set the transition point for intra- and inter-class modulation, respectively (Zhong et al., 2022).

In Sigmoid Contrastive Loss as introduced in SigLIP/SigLIP2 (Bangachev et al., 23 Sep 2025), two parameters ( $t$ as inverse temperature and $b$ as bias, or $b_\mathrm{rel}=b/t$ as relative bias) control the separation of positive (matched) and negative (unmatched) pairs: $\begin{aligned} L(\theta,\phi; t,b_\mathrm{rel}) = &\sum_{i=1}^N \log\bigl(1+\exp[-t(\langle U_i, V_i \rangle-b_\mathrm{rel})]\bigr) \ &+ \sum_{i\ne j} \log\bigl(1+\exp[t(\langle U_i, V_j \rangle - b_\mathrm{rel})]\bigr) \end{aligned}$

2. Motivation and Geometric Intuition

Traditional loss strategies, such as Center Loss or softmax variants, enforce fixed rates for intra-class collapse and inter-class separation, often leading to overfitting or sensitivity to outliers and label noise. Dual sigmoid-based losses address these issues by:

Applying moderate rather than absolute optimization pressure to examples already close to the class center or already well separated, mitigating overfitting especially on noisy or low-quality data (Zhong et al., 2022).
Allowing the training signal to dynamically diminish as optimization achieves compactness or separation, leading to stability and resistance to “freezing” in intra-class variance minimization as seen in Center Loss (Grassa et al., 2020).

Geometric analysis reveals that, for hypersphere-based dual sigmoid loss, gradient magnitudes fall smoothly to zero in regions where further pulling/pushing provides little benefit, constraining model updates to the tangent spaces of the hypersphere and promoting both compactness and robustness.

3. Hyperparameterization and Gradient Properties

Each sigmoid scaling function is parameterized as follows:

Parameter	Role	Typical Range
$s$	Maximum scale (“speed”)	$\sim64$
$k$	Sharpness (“sigmoid slope”)	$\sim80$
$a$	Intra-class center	$[0.8, 0.84]\ (\mathrm{rad})$
$b$	Inter-class center	$\sim1.28\ (\mathrm{rad})$

Tuning $a$ upward (intra-class) makes strong pulling “turn off” for larger angular errors—beneficial under high label noise. $b$ is relatively stable.

Gradient computation proceeds via

$\frac{\partial L_{\mathrm{SFace}}}{\partial x_i} = - r_{\mathrm{intra}}(\theta_{y_i})\frac{\partial \cos\theta_{y_i}}{\partial x_i} + \sum_{j\neq y_i} r_{\mathrm{inter}}(\theta_j)\frac{\partial \cos\theta_j}{\partial x_i}$

ensuring that re-scaling applies only to component-wise cosine gradients, not to the parameters of the sigmoid re-scale functions themselves.

4. Comparative Performance and Ablation Analysis

Empirical ablation indicates that sigmoid re-scaling of both gradient terms outperforms both constant and piecewise-hard-threshold alternatives in deep face recognition:

Constant scaling: 90.05% five-set mean accuracy
Piecewise (step function): 94.64%
Full sigmoid (SFace): 94.80%

The dual sigmoid approach maintains a larger gap in intra-class angular distribution between clean and noisy subsets (e.g., $\Delta\approx 7.6^\circ$ for SFace vs. $4.8^\circ$ for ArcFace), evidencing greater noise robustness (Zhong et al., 2022).

On synthetic “random-sphere” data, employing explicit $t$ and $b_\mathrm{rel}$ parameters leads to faster convergence and larger separation margins, thereby improving retrieval robustness in contrastive representation learning (Bangachev et al., 23 Sep 2025).

In contrastive settings (e.g., SigLIP), the temperature ( $t$ ) and bias ( $b_\mathrm{rel}$ ) parameters implicitly act as global sigmoidal thresholds, shaping the distributions of intra- and inter-modal similarity scores. The concept of $(m, b_\mathrm{rel})$ –Constellations provides a rigorous combinatorial-geometric underpinning: global minimizers correspond to configurations where all positive pairs are separated from negatives by a margin $m$ shifted by $b_\mathrm{rel}$ .

In practice:

Large $t$ enforces sharp decision boundaries
Trainable $b_\mathrm{rel}$ aligns positive and negative distributions, crucial for automatic separation across modalities and for closing the modality gap (Bangachev et al., 23 Sep 2025).
Freezing $b_\mathrm{rel}$ enables explicit control of the induced margin; adapter-like effects can be achieved without additional network parameters.

6. Robustness to Dataset Noise and Implementation Guidelines

Dual sigmoid-based losses excel in environments where class labels are imperfect or some samples are systematically noisy. Practical recommendations include tuning the intra-class sigmoid center $a$ rightward (higher angle) as noise increases, and maintaining high $k$ for sharp control. Inter-class center $b$ is stable. This moderation prevents the over-pulling of noisy or uncertain instances, and prevents over-pushing once inter-class separability is sufficient.

The block-gradient variant is essential: it avoids back-propagating through the scaling functions themselves, maintaining the intended per-sample re-weighting.

Implementation typically requires:

$\ell_2$ normalization of both features and class weights
Per-sample computation of angular distances and cosine similarities
Independent sigmoid re-scales with block-gradient semantics
Batchwise aggregation and update

7. Theoretical Characterization and Geometric Limits

Under dual sigmoid-based regimes with free temperature and bias (contrastive), global minima are characterized by strict separation of positive and negative similarities—quantified by the $(m, b_\mathrm{rel})$ –Constellation formalism. Combinatorial geometric arguments (via spherical codes) relate the embedding dimension $d$ , achievable margin $m$ , and dataset cardinality $N$ , prescribing fundamental limits on deployable capacity and separation. Such dual-sigmoid structures explain empirically observed modality gaps and inform architectural or regularization choices (Bangachev et al., 23 Sep 2025).

The dual sigmoid-based loss function paradigm is now a core methodological tool in robust metric learning and representation alignment, facilitating nuanced control over intra-class compactness and inter-class separation, with demonstrated benefits for noise-robustness, convergence, and multi-modal retrieval accuracy (Zhong et al., 2022, Bangachev et al., 23 Sep 2025).