Sub-center ArcFace: Enhanced Angular Margin

Updated 5 April 2026

Sub-center ArcFace is a method that assigns multiple learnable sub-centers per class to capture intra-class variability and mitigate the impact of mislabeled data.
It uses a training mechanism where only the dominant sub-center (with maximum cosine similarity) receives gradient updates, enabling effective noise isolation and data cleaning.
The approach has led to significant improvements in face and speaker verification benchmarks by refining class boundaries and enhancing model robustness.

Sub-center ArcFace is a robust extension of the ArcFace additive angular margin loss, designed to address class heterogeneity and label noise in large-scale face recognition and speaker verification tasks. Rather than associating each class with a single prototype on the hypersphere, sub-center ArcFace assigns multiple learnable sub-centers per class, enabling the model to explain intra-class variability, absorb mislabeled or noisy samples, and automatically isolate outlier distributions for downstream data cleaning and relabeling (Deng et al., 2018).

1. Mathematical Formulation

Let $x_i \in \mathbb{R}^d$ (or $\mathbf{e}_i$ ) be the $L_2$ -normalized feature embedding of the $i$ -th sample ( $\|x_i\|_2=1$ ). For each class $j \in \{1, \ldots, N\}$ , define $K$ normalized sub-centers $W_{j,1}, ..., W_{j,K} \in \mathbb{R}^d$ ( $\|W_{j,k}\|_2=1$ ). Let $s > 0$ be a fixed scale and $\mathbf{e}_i$ 0 the additive angular margin.

The sub-center ArcFace loss is given by: $\mathbf{e}_i$ 1 where $\mathbf{e}_i$ 2, and $\mathbf{e}_i$ 3 (Deng et al., 2018, Qin et al., 2022, Baali et al., 25 Mar 2026).

The angular margin $\mathbf{e}_i$ 4 is applied only to the logit corresponding to the ground-truth class's dominant sub-center. For each sample and class, the maximum cosine similarity over all sub-centers serves as the effective logit.

2. Training Mechanism and Sub-center Assignment

For each mini-batch and each class $\mathbf{e}_i$ 5, compute the set of inner products $\mathbf{e}_i$ 6 for $\mathbf{e}_i$ 7. For each class $\mathbf{e}_i$ 8, select the sub-center $\mathbf{e}_i$ 9 yielding the highest score: $L_2$ 0.

Forward pass: Retain $L_2$ 1 for all classes $L_2$ 2.
Backward pass: Only the “winning” sub-center $L_2$ 3 receives the gradient update for sample $L_2$ 4; all others remain unchanged for that sample (Deng et al., 2018, Baali et al., 25 Mar 2026).
After convergence: For data cleaning, retain only the “dominant” sub-center (majority assigned) per class and discard samples whose angle to the dominant center exceeds a threshold (e.g., $L_2$ 5).

This mechanism operates identically for all classes, regardless of whether they are the correct label or impostors, ensuring true sample-cluster associations drive the update.

3. Role of Dominant and Non-dominant Sub-centers in Noise Isolation

The sub-center scheme divides each class into $L_2$ 6 clusters on the unit hypersphere. In noisy datasets, the majority of clean data for class $L_2$ 7 forms a tight cluster around one dominant sub-center, while hard, atypical, or mislabeled samples are drawn toward non-dominant sub-centers.

After model convergence:

Dominant sub-center: Represents the clean, well-aligned core of each class.
Non-dominant sub-centers: Absorb ambiguous or mislabeled outliers, effectively separating label noise from useful data (Deng et al., 2018, Qin et al., 2022).

This separation allows automatic data purification by pruning samples distant from the dominant sub-center, and retraining on the resulting cleaned dataset yields substantial generalization improvements.

4. Geometric Interpretation on the Hypersphere

All features and sub-center weights are constrained to the unit hypersphere in $L_2$ 8. Each class is no longer a single point, but a constellation of $L_2$ 9 points. The intra-class angular distribution, potentially multi-modal due to pose, lighting, or noise, is modeled as a mixture of clusters.

The margin $i$ 0 is still enforced at the angular (geodesic) level between the sample and its closest sub-center, which enhances inter-class discrimination while permitting within-class diversity (Deng et al., 2018).

5. Applications: Face Recognition, Speaker Verification, and Noisy Data Regimes

Sub-center ArcFace was initially developed for deep face recognition under massive label noise (e.g., web-scraped MS1M-V0 at 50% label noise) (Deng et al., 2018). Its utility in noisy or poorly-labeled settings has led to adoption in speaker verification, especially under semi-supervised domain adaptation schemes with clustering-derived pseudo-labels.

Face Recognition: Training ResNet-50 with Sub-center ArcFace ( $i$ 1) raises TPR@FPR= $i$ 2 on IJB-C from $i$ 3 (ArcFace) to $i$ 4 (+ $i$ 5). Automatic cleaning with sub-center-based hard pruning and re-training pushes performance to $i$ 6, nearly matching models trained with fully human-labeled data.
Speaker Verification: In domain adaptation on pseudo-labeled CN-Celeb, switching from ArcFace to Sub-center ArcFace reduced EER by approximately $i$ 7 (from $i$ 8 to $i$ 9) and further improvements were achieved by combining with AS-Norm and QMF back-ends (Qin et al., 2022, Baali et al., 25 Mar 2026).
Curriculum Learning: Recent systems leverage the dominant sub-center cosine as a per-sample confidence score to rank and schedule training examples (easy/medium/hard) for adaptive curriculum loss weighting (Baali et al., 25 Mar 2026).

6. Implementation Details, Hyper-parameters, and Empirical Findings

<table> <thead> <tr> <th>Parameter</th> <th>Typical Value</th> <th>Significance</th> </tr> </thead> <tbody> <tr> <td>Sub-centers per class ( $\|x_i\|_2=1$ 0)</td> <td\>3</td> <td>Isolates dominant and outlier modes; $\|x_i\|_2=1$ 1 usually hurts</td> </tr> <tr> <td>Scale ( $\|x_i\|_2=1$ 2)</td> <td\>32 (speaker), 64 (face)</td> <td>Inherited from ArcFace for margin sharpness</td> </tr> <tr> <td>Angular margin ( $\|x_i\|_2=1$ 3)</td> <td\>0.2–0.5</td> <td>Greater $\|x_i\|_2=1$ 4 strengthens decision boundaries</td> </tr> <tr> <td>Angle threshold ( $\|x_i\|_2=1$ 5)</td> <td\>75° (for data cleaning)</td> <td>Robust to pruning high-confidence noise & outliers</td> </tr> </tbody> </table>

Other implementation notes:

Only max-pooling over sub-centers (not softmax-weighted pooling) yielded optimal results (Deng et al., 2018).
Second-round clustering and fine-tuning in semi-supervised settings may degrade final accuracy (Qin et al., 2022).
In curriculum approaches, per-sample confidence $\|x_i\|_2=1$ 6 is tracked via moving average and standard deviation; tiered weights are adaptively scheduled (Baali et al., 25 Mar 2026).

7. Limitations and Practical Recommendations

Sub-center ArcFace requires tuning of $\|x_i\|_2=1$ 7; over-fragmenting classes diminishes the model's ability to aggregate sufficient samples per prototype. Empirically, $\|x_i\|_2=1$ 8 generally suffices for most heterogeneity found in unconstrained visual and signal datasets. Combining sub-center ArcFace with strong domain adaptation or quality control back-ends (e.g., AS-Norm, QMF) is recommended in cross-domain or semi-supervised workflows. Over-iterating clustering and fine-tuning cycles can harm the learned representations; a single round is generally sufficient (Qin et al., 2022).

Sub-center ArcFace presents a modular, generalizable technique for robustifying angular margin losses under label noise, with demonstrated gains across face and speaker recognition benchmarks (Deng et al., 2018, Baali et al., 25 Mar 2026, Qin et al., 2022).