Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dual Sigmoid Loss for Face Recognition

Updated 12 February 2026
  • Dual Sigmoid Loss is a loss formulation that decouples the intra-class pull and inter-class push forces using parameterized sigmoid functions.
  • It enables explicit control over embedding clusters by tuning sigmoid parameters, thereby mitigating overfitting in noisy data scenarios.
  • Empirical results show that SFace with Dual Sigmoid Loss yields competitive accuracy and robustness on benchmarks like LFW, MegaFace, and IJB-C.

Dual Sigmoid Loss, as introduced in the SFace (“Sigmoid-Constrained Hypersphere Loss”) framework, is a loss formulation for deep face recognition that decouples the intra-class “pull” and inter-class “push” forces by modulating their gradient contributions with two independent, parameterized sigmoid functions. This construction allows for explicit control over where and how strongly each sample is encouraged to cluster with its class center or to separate from other classes on the unit hypersphere. By tuning these dual sigmoids, SFace mitigates overfitting, especially in the presence of noisy or low-quality training data, and achieves robust, discriminative face embeddings (Zhong et al., 2022).

1. Mathematical Formulation and Notation

The embedding xiRdx_i \in \mathbb{R}^d of input XiX_i is produced by the network and both xix_i and each class center WjW_j (the columns of the last-layer weight matrix WRd×CW \in \mathbb{R}^{d \times C}) are 2\ell_2-normalized: xi=Wj=1\|x_i\| = \|W_j\| = 1. The angular similarity between xix_i and WjW_j is given by cosθj=Wjxi\cos \theta_j = W_j^\top x_i and XiX_i0.

The SFace loss for XiX_i1 is:

XiX_i2

where

XiX_i3

XiX_i4

The XiX_i5 operator is the block-gradient, meaning the value is used in the forward pass but not differentiated through during backpropagation.

2. Dual Sigmoid Re-Scale Functions

SFace employs two independent sigmoid functions to modulate the intra-class and inter-class gradient scales as a function of angle XiX_i6:

  • Intra-class (pull) re-scale:

XiX_i7

XiX_i8

  • XiX_i9 controls sharpness; xix_i0 is the inflection (“onset”) angle; xix_i1 is a global scale (commonly 64).
  • For xix_i2, intra-class pull is suppressed; above xix_i3, it ramps up.
    • Inter-class (push) re-scale:

xix_i4

xix_i5

  • xix_i6, xix_i7 parameterize slope and margin.
  • For xix_i8, inter-class push is strong; for xix_i9, push is negligible.

These dual sigmoids allow fine-grained, decoupled control: one can enforce tight clustering for clean data (low WjW_j0), but increase WjW_j1 to halt “pull” early in noisy scenarios and avoid overfitting.

3. Gradient Modulation and Optimization Behavior

Only the cosine terms receive backpropagated gradients; the sigmoid re-scales act as fixed coefficients during each update. Specifically,

WjW_j2

WjW_j3

Since WjW_j4, the gradient magnitudes with respect to WjW_j5 are

WjW_j6

Thus, SFace directly determines the effective angular “pull” or “push” based on how far samples are from the decision margins, defined by the chosen WjW_j7 and WjW_j8.

4. Parameter Tuning and Loss Behavior Under Noise

Parameter selection critically shapes the embedding geometry:

  • WjW_j9: Sets the “target” intra-class angle. Higher WRd×CW \in \mathbb{R}^{d \times C}0 increases noise tolerance by ceasing intra-class pull earlier, making embeddings looser; lower values enforce tighter clusters.
  • WRd×CW \in \mathbb{R}^{d \times C}1: Controls gradient sharpness at the intra-class boundary.
  • WRd×CW \in \mathbb{R}^{d \times C}2: Sets the angular margin for inter-class separation; typical values are WRd×CW \in \mathbb{R}^{d \times C}3–WRd×CW \in \mathbb{R}^{d \times C}4 radians (WRd×CW \in \mathbb{R}^{d \times C}5).
  • WRd×CW \in \mathbb{R}^{d \times C}6: Slopes chosen so the transition region is narrow but not a step (e.g., WRd×CW \in \mathbb{R}^{d \times C}7).

In practice, higher WRd×CW \in \mathbb{R}^{d \times C}8 values help SFace prevent overfitting when label noise increases, as noisy samples are not forcefully incorporated into incorrect clusters. On clean datasets, lower WRd×CW \in \mathbb{R}^{d \times C}9 supports compact representations.

Scenario 2\ell_20 (example) 2\ell_21 (example) Impact
Clean (little noise) 0.80 1.20 Tight clusters, clear separation
Noisy (label noise grows) 0.84 1.20 Early pull-off, less overfitting to noise

5. Comparison to Other Hypersphere Margin Losses

Traditional angular-margin losses (SphereFace, CosFace, ArcFace) impose margin constraints via additive or multiplicative angular shifts:

  • SphereFace: 2\ell_22,
  • ArcFace: 2\ell_23,
  • CosFace: logit margin-shifting by 2\ell_24.

Gradients in those schemes involve implicit, coupled re-scale factors 2\ell_25 and 2\ell_26 that depend on all logits simultaneously and cannot be independently tuned. SFace’s design offers explicit, independent re-scaling for intra- and inter-class terms, and operates solely on single angles (not mixtures thereof), thus simplifying tuning and offering robust performance under non-ideal data conditions.

6. Implementation Details

A standard SFace pipeline consists of:

  • Network (e.g., ResNet50) generating a 2\ell_27-dimensional feature vector per sample, followed by 2\ell_28 normalization.
  • Last-layer weights 2\ell_29, columns xi=Wj=1\|x_i\| = \|W_j\| = 10-normalized per forward pass.
  • For each sample, compute xi=Wj=1\|x_i\| = \|W_j\| = 11 and xi=Wj=1\|x_i\| = \|W_j\| = 12 via xi=Wj=1\|x_i\| = \|W_j\| = 13.
  • Compute xi=Wj=1\|x_i\| = \|W_j\| = 14 and xi=Wj=1\|x_i\| = \|W_j\| = 15 via the parameterized sigmoid functions, with gradients blocked.
  • Loss assembly: xi=Wj=1\|x_i\| = \|W_j\| = 16
  • SGD updates are applied to both xi=Wj=1\|x_i\| = \|W_j\| = 17 and network parameters.

7. Empirical Results and Significance

SFace demonstrates competitive or superior accuracy and robustness across diverse benchmarks:

  • On MS1MV2 with ResNet100 and xi=Wj=1\|x_i\| = \|W_j\| = 18:
    • LFW: 99.82% (ArcFace 99.83%)
    • YTF: 98.06% (ArcFace 98.02%)
    • MegaFace: Top-1 = 98.50% (ArcFace 98.35%), TAR@FAR = 98.61% (ArcFace 98.48%)
    • IJB-C: 1:1 TAR@FAR=1e-5–1e-1 improved over ArcFace by 0.5–1%
  • Under increasing label-noise (WebFace, 0–20%), SFace’s accuracy degrades more gracefully than ArcFace or CosFace, as xi=Wj=1\|x_i\| = \|W_j\| = 19 is increased to limit the effect of noisy samples.
  • On IJB-A and IJB-C, SFace consistently outperforms ArcFace by 0.2–0.5% at low FAR and in Rank-1/TPIR metrics.

This suggests that the decoupled, sigmoidal gradient modulation offers a robust means to balance discriminative training against overfitting, particularly with imperfect data (Zhong et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dual Sigmoid Loss.