Papers
Topics
Authors
Recent
Search
2000 character limit reached

SphereFace, CosFace, and ArcFace: Angular Margin Loss

Updated 15 April 2026
  • SphereFace, CosFace, and ArcFace are angular-margin softmax losses that improve face recognition by enforcing stricter angular margins between classes through multiplicative or additive adjustments.
  • They operate on L2-normalized features mapped onto a hypersphere, unifying geometry to optimize intra-class compactness and inter-class separability across diverse benchmarks.
  • Empirical results on datasets like VGGFace2 and MS-Celeb-1M confirm their effectiveness, with SphereFace excelling in low false-accept-rate regimes and CosFace and ArcFace offering enhanced training stability.

SphereFace, CosFace, and ArcFace are foundational angular-margin softmax losses designed to enhance intra-class compactness and inter-class separability for deep face recognition. Under a unified hyperspherical framework, each variant differentiates itself through the specific manipulation of angular margins in the normalized softmax cross-entropy loss, directly impacting the learning geometry and stability of face representation models. Their effect is controlled via margin functions applied in the angular space between L2L_2-normalized features and class-wise weight vectors, inducing target disambiguation in an open-set identification regime (Zheng et al., 24 Sep 2025, Liu et al., 2021).

1. Unified Hyperspherical Loss Framework

These methods operate within the normalized softmax loss, where both feature vectors and class proxies are L2L_2-normalized onto a hypersphere. The loss can be written as: L=1N ⁣i=1Nlogexp(sψ(θyi))exp(sψ(θyi))+jyiexp(sη(θj))\mathcal{L} = -\frac{1}{N}\!\sum_{i=1}^{N} \log \frac{\exp(s\,\psi(\theta_{y_i}))}{\exp(s\,\psi(\theta_{y_i}))+\sum_{j \neq y_i}\exp(s\,\eta(\theta_j))} where:

  • θj\theta_j is the angle between feature xi\mathbf{x}_i and weight vector Wj\mathbf{W}_j, xi=Wj=1\|\mathbf{x}_i\| = \|\mathbf{W}_j\| = 1
  • ss is a learnable scaling factor
  • ψ\psi and η\eta are angular activation functions for the target and non-target classes, respectively

A margin function L2L_20 encapsulates the additional angular penalty. The sufficient margin condition L2L_21 for L2L_22 enforces compactness and class separability (Liu et al., 2021).

2. SphereFace: Multiplicative Angular Margin

Formulation: SphereFace imposes a multiplicative angular margin, replacing L2L_23 with L2L_24 in the cosine similarity: L2L_25 for L2L_26, ensuring strict monotonicity.

Decision Boundary: The classification rule L2L_27 geometrically increases the effective separation between classes on the hypersphere.

Characteristics:

  • The similarity curve L2L_28 oscillates and requires piecewise correction for monotonicity.
  • The gradient L2L_29 exhibits multiple zero-crossings, leading to training instability, especially for large L=1N ⁣i=1Nlogexp(sψ(θyi))exp(sψ(θyi))+jyiexp(sη(θj))\mathcal{L} = -\frac{1}{N}\!\sum_{i=1}^{N} \log \frac{\exp(s\,\psi(\theta_{y_i}))}{\exp(s\,\psi(\theta_{y_i}))+\sum_{j \neq y_i}\exp(s\,\eta(\theta_j))}0.
  • Empirically, training with large L=1N ⁣i=1Nlogexp(sψ(θyi))exp(sψ(θyi))+jyiexp(sη(θj))\mathcal{L} = -\frac{1}{N}\!\sum_{i=1}^{N} \log \frac{\exp(s\,\psi(\theta_{y_i}))}{\exp(s\,\psi(\theta_{y_i}))+\sum_{j \neq y_i}\exp(s\,\eta(\theta_j))}1 can cause oscillation and collapse, requiring stabilization techniques (Zheng et al., 24 Sep 2025, Liu et al., 2021).

3. CosFace: Additive Cosine Margin

Formulation: CosFace introduces an additive margin subtracted outside the cosine function: L=1N ⁣i=1Nlogexp(sψ(θyi))exp(sψ(θyi))+jyiexp(sη(θj))\mathcal{L} = -\frac{1}{N}\!\sum_{i=1}^{N} \log \frac{\exp(s\,\psi(\theta_{y_i}))}{\exp(s\,\psi(\theta_{y_i}))+\sum_{j \neq y_i}\exp(s\,\eta(\theta_j))}2

Decision Boundary: The rule L=1N ⁣i=1Nlogexp(sψ(θyi))exp(sψ(θyi))+jyiexp(sη(θj))\mathcal{L} = -\frac{1}{N}\!\sum_{i=1}^{N} \log \frac{\exp(s\,\psi(\theta_{y_i}))}{\exp(s\,\psi(\theta_{y_i}))+\sum_{j \neq y_i}\exp(s\,\eta(\theta_j))}3 forces a fixed angular gap between classes, with the gap magnitude set by L=1N ⁣i=1Nlogexp(sψ(θyi))exp(sψ(θyi))+jyiexp(sη(θj))\mathcal{L} = -\frac{1}{N}\!\sum_{i=1}^{N} \log \frac{\exp(s\,\psi(\theta_{y_i}))}{\exp(s\,\psi(\theta_{y_i}))+\sum_{j \neq y_i}\exp(s\,\eta(\theta_j))}4.

Characteristics:

  • The similarity curve is a vertical shift, L=1N ⁣i=1Nlogexp(sψ(θyi))exp(sψ(θyi))+jyiexp(sη(θj))\mathcal{L} = -\frac{1}{N}\!\sum_{i=1}^{N} \log \frac{\exp(s\,\psi(\theta_{y_i}))}{\exp(s\,\psi(\theta_{y_i}))+\sum_{j \neq y_i}\exp(s\,\eta(\theta_j))}5, which remains monotonic for L=1N ⁣i=1Nlogexp(sψ(θyi))exp(sψ(θyi))+jyiexp(sη(θj))\mathcal{L} = -\frac{1}{N}\!\sum_{i=1}^{N} \log \frac{\exp(s\,\psi(\theta_{y_i}))}{\exp(s\,\psi(\theta_{y_i}))+\sum_{j \neq y_i}\exp(s\,\eta(\theta_j))}6.
  • The gradient L=1N ⁣i=1Nlogexp(sψ(θyi))exp(sψ(θyi))+jyiexp(sη(θj))\mathcal{L} = -\frac{1}{N}\!\sum_{i=1}^{N} \log \frac{\exp(s\,\psi(\theta_{y_i}))}{\exp(s\,\psi(\theta_{y_i}))+\sum_{j \neq y_i}\exp(s\,\eta(\theta_j))}7, unaffected by L=1N ⁣i=1Nlogexp(sψ(θyi))exp(sψ(θyi))+jyiexp(sη(θj))\mathcal{L} = -\frac{1}{N}\!\sum_{i=1}^{N} \log \frac{\exp(s\,\psi(\theta_{y_i}))}{\exp(s\,\psi(\theta_{y_i}))+\sum_{j \neq y_i}\exp(s\,\eta(\theta_j))}8.
  • Offers high training stability, but the penalty is spatially uniform and insensitive to intra-class sample positions (Zheng et al., 24 Sep 2025, Liu et al., 2021).

4. ArcFace: Additive Angular Margin

Formulation: ArcFace enforces an angular margin by augmenting the argument inside the cosine: L=1N ⁣i=1Nlogexp(sψ(θyi))exp(sψ(θyi))+jyiexp(sη(θj))\mathcal{L} = -\frac{1}{N}\!\sum_{i=1}^{N} \log \frac{\exp(s\,\psi(\theta_{y_i}))}{\exp(s\,\psi(\theta_{y_i}))+\sum_{j \neq y_i}\exp(s\,\eta(\theta_j))}9

Decision Boundary: The rule θj\theta_j0 translates to θj\theta_j1 (principal range), reflecting an angular offset for the target class.

Characteristics:

  • The similarity curve is a left-shift, θj\theta_j2.
  • Monotonicity is preserved only for θj\theta_j3; if θj\theta_j4 is too large, non-monotonic intervals and negative gradients appear.
  • The gradient θj\theta_j5 may introduce conflicting gradient directions for certain parameterizations (Zheng et al., 24 Sep 2025, Liu et al., 2021).

5. Comparative Analysis

The table below summarizes the core properties of SphereFace, CosFace, and ArcFace based on their penalty mechanisms, similarity curves, and stability:

Method θj\theta_j6 Penalty Pattern (θ∈[0,π]) Stability Features
SphereFace θj\theta_j7 (piecewise corrected) Small penalty near θj\theta_j8θj\theta_j90, large near %%%%3L2L_23L2L_23%%%%2 Oscillating similarity & gradient; unstable for large xi\mathbf{x}_i3
CosFace xi\mathbf{x}_i4 Uniform penalty Stable, but focus fixed at xi\mathbf{x}_i5xi\mathbf{x}_i6xi\mathbf{x}_i7
ArcFace xi\mathbf{x}_i8 Roughly uniform; endpoint emphasis Monotonic only for xi\mathbf{x}_i9; negative gradients possible if Wj\mathbf{W}_j0 too large

Implications: Multiplicative margins are more “geometric” and adaptive across the hypersphere, while additive margins (CosFace, ArcFace) favor operational stability but with limited flexibility in margin spatial distribution (Liu et al., 2021).

6. Optimization and Stability: Characteristic Gradient Detachment and Feature Normalization

Training instability in margin-based angular losses primarily arises from complex or oscillatory margin functions with nontrivial derivatives. The “characteristic gradient detachment” (CGD) method ensures stable training by detaching the margin function Wj\mathbf{W}_j1 from backpropagation in SphereFace-R, making the angular gradient resemble that of the basic normalized softmax loss.

Feature normalization schemes impact the learned representation:

  • No feature normalization (NFN): Wj\mathbf{W}_j2 unconstrained.
  • Hard feature normalization (HFN): Wj\mathbf{W}_j3 enforced.
  • Soft feature normalization (SFN): Penalty regularizes Wj\mathbf{W}_j4 toward Wj\mathbf{W}_j5, retaining magnitude information.

Empirical results confirm that applying CGD and SFN with SphereFace-R eliminates oscillating loss trajectories and enables convergence matching the best additive-margin schemes (Liu et al., 2021).

7. Empirical Performance and Practical Impact

Experiments across VGGFace2, MS-Celeb-1M, MegaFace, and IJB benchmarks demonstrate the following trends:

  • SphereFace with proper normalization and CGD is competitive or superior in low false-accept-rate (FAR) regimes, especially with soft normalization.
  • ArcFace and CosFace maintain robust performance with high training stability (Liu et al., 2021).

Overall, SphereFace, CosFace, and ArcFace formalize the margin-based softmax landscape for hyperspherical face recognition, each balancing geometric margin strength, focus of penalty, and optimization stability within a unified framework (Zheng et al., 24 Sep 2025, Liu et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SphereFace, CosFace, and ArcFace.