Dual Sigmoid Loss for Face Recognition

Updated 12 February 2026

Dual Sigmoid Loss is a loss formulation that decouples the intra-class pull and inter-class push forces using parameterized sigmoid functions.
It enables explicit control over embedding clusters by tuning sigmoid parameters, thereby mitigating overfitting in noisy data scenarios.
Empirical results show that SFace with Dual Sigmoid Loss yields competitive accuracy and robustness on benchmarks like LFW, MegaFace, and IJB-C.

Dual Sigmoid Loss, as introduced in the SFace (“Sigmoid-Constrained Hypersphere Loss”) framework, is a loss formulation for deep face recognition that decouples the intra-class “pull” and inter-class “push” forces by modulating their gradient contributions with two independent, parameterized sigmoid functions. This construction allows for explicit control over where and how strongly each sample is encouraged to cluster with its class center or to separate from other classes on the unit hypersphere. By tuning these dual sigmoids, SFace mitigates overfitting, especially in the presence of noisy or low-quality training data, and achieves robust, discriminative face embeddings (Zhong et al., 2022).

1. Mathematical Formulation and Notation

The embedding $x_i \in \mathbb{R}^d$ of input $X_i$ is produced by the network and both $x_i$ and each class center $W_j$ (the columns of the last-layer weight matrix $W \in \mathbb{R}^{d \times C}$ ) are $\ell_2$ -normalized: $\|x_i\| = \|W_j\| = 1$ . The angular similarity between $x_i$ and $W_j$ is given by $\cos \theta_j = W_j^\top x_i$ and $X_i$ 0.

The SFace loss for $X_i$ 1 is:

$X_i$ 2

where

$X_i$ 3

$X_i$ 4

The $X_i$ 5 operator is the block-gradient, meaning the value is used in the forward pass but not differentiated through during backpropagation.

2. Dual Sigmoid Re-Scale Functions

SFace employs two independent sigmoid functions to modulate the intra-class and inter-class gradient scales as a function of angle $X_i$ 6:

Intra-class (pull) re-scale:

$X_i$ 7

$X_i$ 8

$X_i$ 9 controls sharpness; $x_i$ 0 is the inflection (“onset”) angle; $x_i$ 1 is a global scale (commonly 64).
For $x_i$ $x_{i}$ 2, intra-class pull is suppressed; above $x_i$ $x_{i}$ 3, it ramps up.
- Inter-class (push) re-scale:

$x_i$ 4

$x_i$ 5

$x_i$ 6, $x_i$ 7 parameterize slope and margin.
For $x_i$ 8, inter-class push is strong; for $x_i$ 9, push is negligible.

These dual sigmoids allow fine-grained, decoupled control: one can enforce tight clustering for clean data (low $W_j$ 0), but increase $W_j$ 1 to halt “pull” early in noisy scenarios and avoid overfitting.

3. Gradient Modulation and Optimization Behavior

Only the cosine terms receive backpropagated gradients; the sigmoid re-scales act as fixed coefficients during each update. Specifically,

$W_j$ 2

$W_j$ 3

Since $W_j$ 4, the gradient magnitudes with respect to $W_j$ 5 are

$W_j$ 6

Thus, SFace directly determines the effective angular “pull” or “push” based on how far samples are from the decision margins, defined by the chosen $W_j$ 7 and $W_j$ 8.

4. Parameter Tuning and Loss Behavior Under Noise

Parameter selection critically shapes the embedding geometry:

$W_j$ 9: Sets the “target” intra-class angle. Higher $W \in \mathbb{R}^{d \times C}$ 0 increases noise tolerance by ceasing intra-class pull earlier, making embeddings looser; lower values enforce tighter clusters.
$W \in \mathbb{R}^{d \times C}$ 1: Controls gradient sharpness at the intra-class boundary.
$W \in \mathbb{R}^{d \times C}$ 2: Sets the angular margin for inter-class separation; typical values are $W \in \mathbb{R}^{d \times C}$ 3– $W \in \mathbb{R}^{d \times C}$ 4 radians ( $W \in \mathbb{R}^{d \times C}$ 5).
$W \in \mathbb{R}^{d \times C}$ 6: Slopes chosen so the transition region is narrow but not a step (e.g., $W \in \mathbb{R}^{d \times C}$ 7).

In practice, higher $W \in \mathbb{R}^{d \times C}$ 8 values help SFace prevent overfitting when label noise increases, as noisy samples are not forcefully incorporated into incorrect clusters. On clean datasets, lower $W \in \mathbb{R}^{d \times C}$ 9 supports compact representations.

Scenario	$\ell_2$ 0 (example)	$\ell_2$ 1 (example)	Impact
Clean (little noise)	0.80	1.20	Tight clusters, clear separation
Noisy (label noise grows)	0.84	1.20	Early pull-off, less overfitting to noise

5. Comparison to Other Hypersphere Margin Losses

Traditional angular-margin losses (SphereFace, CosFace, ArcFace) impose margin constraints via additive or multiplicative angular shifts:

SphereFace: $\ell_2$ 2,
ArcFace: $\ell_2$ 3,
CosFace: logit margin-shifting by $\ell_2$ 4.

Gradients in those schemes involve implicit, coupled re-scale factors $\ell_2$ 5 and $\ell_2$ 6 that depend on all logits simultaneously and cannot be independently tuned. SFace’s design offers explicit, independent re-scaling for intra- and inter-class terms, and operates solely on single angles (not mixtures thereof), thus simplifying tuning and offering robust performance under non-ideal data conditions.

6. Implementation Details

A standard SFace pipeline consists of:

Network (e.g., ResNet50) generating a $\ell_2$ 7-dimensional feature vector per sample, followed by $\ell_2$ 8 normalization.
Last-layer weights $\ell_2$ 9, columns $\|x_i\| = \|W_j\| = 1$ 0-normalized per forward pass.
For each sample, compute $\|x_i\| = \|W_j\| = 1$ 1 and $\|x_i\| = \|W_j\| = 1$ 2 via $\|x_i\| = \|W_j\| = 1$ 3.
Compute $\|x_i\| = \|W_j\| = 1$ 4 and $\|x_i\| = \|W_j\| = 1$ 5 via the parameterized sigmoid functions, with gradients blocked.
Loss assembly: $\|x_i\| = \|W_j\| = 1$ 6
SGD updates are applied to both $\|x_i\| = \|W_j\| = 1$ 7 and network parameters.

7. Empirical Results and Significance

SFace demonstrates competitive or superior accuracy and robustness across diverse benchmarks:

On MS1MV2 with ResNet100 and $\|x_i\| = \|W_j\| = 1$ $∥ x_{i} ∥ = ∥ W_{j} ∥ = 1$ 8:
- LFW: 99.82% (ArcFace 99.83%)
- YTF: 98.06% (ArcFace 98.02%)
- MegaFace: Top-1 = 98.50% (ArcFace 98.35%), TAR@FAR = 98.61% (ArcFace 98.48%)
- IJB-C: 1:1 TAR@FAR=1e-5–1e-1 improved over ArcFace by 0.5–1%
Under increasing label-noise (WebFace, 0–20%), SFace’s accuracy degrades more gracefully than ArcFace or CosFace, as $\|x_i\| = \|W_j\| = 1$ 9 is increased to limit the effect of noisy samples.
On IJB-A and IJB-C, SFace consistently outperforms ArcFace by 0.2–0.5% at low FAR and in Rank-1/TPIR metrics.

This suggests that the decoupled, sigmoidal gradient modulation offers a robust means to balance discriminative training against overfitting, particularly with imperfect data (Zhong et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

SFace: Sigmoid-Constrained Hypersphere Loss for Robust Face Recognition (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dual Sigmoid Loss.

Dual Sigmoid Loss for Face Recognition

1. Mathematical Formulation and Notation

2. Dual Sigmoid Re-Scale Functions

3. Gradient Modulation and Optimization Behavior

4. Parameter Tuning and Loss Behavior Under Noise

5. Comparison to Other Hypersphere Margin Losses

6. Implementation Details

7. Empirical Results and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Dual Sigmoid Loss for Face Recognition

1. Mathematical Formulation and Notation

2. Dual Sigmoid Re-Scale Functions

3. Gradient Modulation and Optimization Behavior

4. Parameter Tuning and Loss Behavior Under Noise

5. Comparison to Other Hypersphere Margin Losses

6. Implementation Details

7. Empirical Results and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research