SphereFace, CosFace, and ArcFace: Angular Margin Loss

Updated 15 April 2026

SphereFace, CosFace, and ArcFace are angular-margin softmax losses that improve face recognition by enforcing stricter angular margins between classes through multiplicative or additive adjustments.
They operate on L2-normalized features mapped onto a hypersphere, unifying geometry to optimize intra-class compactness and inter-class separability across diverse benchmarks.
Empirical results on datasets like VGGFace2 and MS-Celeb-1M confirm their effectiveness, with SphereFace excelling in low false-accept-rate regimes and CosFace and ArcFace offering enhanced training stability.

SphereFace, CosFace, and ArcFace are foundational angular-margin softmax losses designed to enhance intra-class compactness and inter-class separability for deep face recognition. Under a unified hyperspherical framework, each variant differentiates itself through the specific manipulation of angular margins in the normalized softmax cross-entropy loss, directly impacting the learning geometry and stability of face representation models. Their effect is controlled via margin functions applied in the angular space between $L_2$ -normalized features and class-wise weight vectors, inducing target disambiguation in an open-set identification regime (Zheng et al., 24 Sep 2025, Liu et al., 2021).

1. Unified Hyperspherical Loss Framework

These methods operate within the normalized softmax loss, where both feature vectors and class proxies are $L_2$ -normalized onto a hypersphere. The loss can be written as: $\mathcal{L} = -\frac{1}{N}\!\sum_{i=1}^{N} \log \frac{\exp(s\,\psi(\theta_{y_i}))}{\exp(s\,\psi(\theta_{y_i}))+\sum_{j \neq y_i}\exp(s\,\eta(\theta_j))}$ where:

$\theta_j$ is the angle between feature $\mathbf{x}_i$ and weight vector $\mathbf{W}_j$ , $\|\mathbf{x}_i\| = \|\mathbf{W}_j\| = 1$
$s$ is a learnable scaling factor
$\psi$ and $\eta$ are angular activation functions for the target and non-target classes, respectively

A margin function $L_2$ 0 encapsulates the additional angular penalty. The sufficient margin condition $L_2$ 1 for $L_2$ 2 enforces compactness and class separability (Liu et al., 2021).

2. SphereFace: Multiplicative Angular Margin

Formulation: SphereFace imposes a multiplicative angular margin, replacing $L_2$ 3 with $L_2$ 4 in the cosine similarity: $L_2$ 5 for $L_2$ 6, ensuring strict monotonicity.

Decision Boundary: The classification rule $L_2$ 7 geometrically increases the effective separation between classes on the hypersphere.

Characteristics:

The similarity curve $L_2$ 8 oscillates and requires piecewise correction for monotonicity.
The gradient $L_2$ 9 exhibits multiple zero-crossings, leading to training instability, especially for large $\mathcal{L} = -\frac{1}{N}\!\sum_{i=1}^{N} \log \frac{\exp(s\,\psi(\theta_{y_i}))}{\exp(s\,\psi(\theta_{y_i}))+\sum_{j \neq y_i}\exp(s\,\eta(\theta_j))}$ 0.
Empirically, training with large $\mathcal{L} = -\frac{1}{N}\!\sum_{i=1}^{N} \log \frac{\exp(s\,\psi(\theta_{y_i}))}{\exp(s\,\psi(\theta_{y_i}))+\sum_{j \neq y_i}\exp(s\,\eta(\theta_j))}$ 1 can cause oscillation and collapse, requiring stabilization techniques (Zheng et al., 24 Sep 2025, Liu et al., 2021).

3. CosFace: Additive Cosine Margin

Formulation: CosFace introduces an additive margin subtracted outside the cosine function: $\mathcal{L} = -\frac{1}{N}\!\sum_{i=1}^{N} \log \frac{\exp(s\,\psi(\theta_{y_i}))}{\exp(s\,\psi(\theta_{y_i}))+\sum_{j \neq y_i}\exp(s\,\eta(\theta_j))}$ 2

Decision Boundary: The rule $\mathcal{L} = -\frac{1}{N}\!\sum_{i=1}^{N} \log \frac{\exp(s\,\psi(\theta_{y_i}))}{\exp(s\,\psi(\theta_{y_i}))+\sum_{j \neq y_i}\exp(s\,\eta(\theta_j))}$ 3 forces a fixed angular gap between classes, with the gap magnitude set by $\mathcal{L} = -\frac{1}{N}\!\sum_{i=1}^{N} \log \frac{\exp(s\,\psi(\theta_{y_i}))}{\exp(s\,\psi(\theta_{y_i}))+\sum_{j \neq y_i}\exp(s\,\eta(\theta_j))}$ 4.

Characteristics:

The similarity curve is a vertical shift, $\mathcal{L} = -\frac{1}{N}\!\sum_{i=1}^{N} \log \frac{\exp(s\,\psi(\theta_{y_i}))}{\exp(s\,\psi(\theta_{y_i}))+\sum_{j \neq y_i}\exp(s\,\eta(\theta_j))}$ 5, which remains monotonic for $\mathcal{L} = -\frac{1}{N}\!\sum_{i=1}^{N} \log \frac{\exp(s\,\psi(\theta_{y_i}))}{\exp(s\,\psi(\theta_{y_i}))+\sum_{j \neq y_i}\exp(s\,\eta(\theta_j))}$ 6.
The gradient $\mathcal{L} = -\frac{1}{N}\!\sum_{i=1}^{N} \log \frac{\exp(s\,\psi(\theta_{y_i}))}{\exp(s\,\psi(\theta_{y_i}))+\sum_{j \neq y_i}\exp(s\,\eta(\theta_j))}$ 7, unaffected by $\mathcal{L} = -\frac{1}{N}\!\sum_{i=1}^{N} \log \frac{\exp(s\,\psi(\theta_{y_i}))}{\exp(s\,\psi(\theta_{y_i}))+\sum_{j \neq y_i}\exp(s\,\eta(\theta_j))}$ 8.
Offers high training stability, but the penalty is spatially uniform and insensitive to intra-class sample positions (Zheng et al., 24 Sep 2025, Liu et al., 2021).

4. ArcFace: Additive Angular Margin

Formulation: ArcFace enforces an angular margin by augmenting the argument inside the cosine: $\mathcal{L} = -\frac{1}{N}\!\sum_{i=1}^{N} \log \frac{\exp(s\,\psi(\theta_{y_i}))}{\exp(s\,\psi(\theta_{y_i}))+\sum_{j \neq y_i}\exp(s\,\eta(\theta_j))}$ 9

Decision Boundary: The rule $\theta_j$ 0 translates to $\theta_j$ 1 (principal range), reflecting an angular offset for the target class.

Characteristics:

The similarity curve is a left-shift, $\theta_j$ 2.
Monotonicity is preserved only for $\theta_j$ 3; if $\theta_j$ 4 is too large, non-monotonic intervals and negative gradients appear.
The gradient $\theta_j$ 5 may introduce conflicting gradient directions for certain parameterizations (Zheng et al., 24 Sep 2025, Liu et al., 2021).

5. Comparative Analysis

The table below summarizes the core properties of SphereFace, CosFace, and ArcFace based on their penalty mechanisms, similarity curves, and stability:

Method	$\theta_j$ 6	Penalty Pattern (θ∈[0,π])	Stability Features
SphereFace	$\theta_j$ 7 (piecewise corrected)	Small penalty near $\theta_j$ 8 $\theta_j$ 90, large near %%%%3 $L_2$ 3 $L_2$ 3%%%%2	Oscillating similarity & gradient; unstable for large $\mathbf{x}_i$ 3
CosFace	$\mathbf{x}_i$ 4	Uniform penalty	Stable, but focus fixed at $\mathbf{x}_i$ 5 $\mathbf{x}_i$ 6 $\mathbf{x}_i$ 7
ArcFace	$\mathbf{x}_i$ 8	Roughly uniform; endpoint emphasis	Monotonic only for $\mathbf{x}_i$ 9; negative gradients possible if $\mathbf{W}_j$ 0 too large

Implications: Multiplicative margins are more “geometric” and adaptive across the hypersphere, while additive margins (CosFace, ArcFace) favor operational stability but with limited flexibility in margin spatial distribution (Liu et al., 2021).

6. Optimization and Stability: Characteristic Gradient Detachment and Feature Normalization

Training instability in margin-based angular losses primarily arises from complex or oscillatory margin functions with nontrivial derivatives. The “characteristic gradient detachment” (CGD) method ensures stable training by detaching the margin function $\mathbf{W}_j$ 1 from backpropagation in SphereFace-R, making the angular gradient resemble that of the basic normalized softmax loss.

Feature normalization schemes impact the learned representation:

No feature normalization (NFN): $\mathbf{W}_j$ 2 unconstrained.
Hard feature normalization (HFN): $\mathbf{W}_j$ 3 enforced.
Soft feature normalization (SFN): Penalty regularizes $\mathbf{W}_j$ 4 toward $\mathbf{W}_j$ 5, retaining magnitude information.

Empirical results confirm that applying CGD and SFN with SphereFace-R eliminates oscillating loss trajectories and enables convergence matching the best additive-margin schemes (Liu et al., 2021).

7. Empirical Performance and Practical Impact

Experiments across VGGFace2, MS-Celeb-1M, MegaFace, and IJB benchmarks demonstrate the following trends:

SphereFace with proper normalization and CGD is competitive or superior in low false-accept-rate (FAR) regimes, especially with soft normalization.
ArcFace and CosFace maintain robust performance with high training stability (Liu et al., 2021).

Overall, SphereFace, CosFace, and ArcFace formalize the margin-based softmax landscape for hyperspherical face recognition, each balancing geometric margin strength, focus of penalty, and optimization stability within a unified framework (Zheng et al., 24 Sep 2025, Liu et al., 2021).

Markdown Report Issue Upgrade to Chat

References (2)

ExpFace: Exponential Angular Margin Loss for Deep Face Recognition (2025)

SphereFace Revived: Unifying Hyperspherical Face Recognition (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SphereFace, CosFace, and ArcFace.

SphereFace, CosFace, and ArcFace: Angular Margin Loss

1. Unified Hyperspherical Loss Framework

2. SphereFace: Multiplicative Angular Margin

3. CosFace: Additive Cosine Margin

4. ArcFace: Additive Angular Margin

5. Comparative Analysis

6. Optimization and Stability: Characteristic Gradient Detachment and Feature Normalization

7. Empirical Performance and Practical Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SphereFace, CosFace, and ArcFace: Angular Margin Loss

1. Unified Hyperspherical Loss Framework

2. SphereFace: Multiplicative Angular Margin

3. CosFace: Additive Cosine Margin

4. ArcFace: Additive Angular Margin

5. Comparative Analysis

6. Optimization and Stability: Characteristic Gradient Detachment and Feature Normalization

7. Empirical Performance and Practical Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research