X²-Softmax: Adaptive Quadratic Angular Loss

Updated 30 June 2026

X²-Softmax is an angular-margin loss function that employs a quadratic, angle-dependent margin to boost discriminative feature learning in face recognition.
It adaptively strengthens margins based on the angular separation between class centers, addressing challenges in imbalanced and closely spaced classes.
Empirical results indicate that X²-Softmax achieves competitive or superior performance compared to fixed-margin losses like ArcFace and CosFace on standard benchmarks.

X²-Softmax is an angular-margin loss function designed to improve face recognition performance by adaptively increasing the separation between class centers as their angular distance grows. Unlike fixed-margin losses such as CosFace and ArcFace, X²-Softmax introduces a quadratic, angle-dependent margin that offers enhanced flexibility in class separation, particularly in the presence of inter-class imbalance and varied angular distributions. This approach seeks to ensure that classes with large angular separations are enforced with stronger margins, thereby increasing discriminative power without impeding convergence for tightly spaced classes (Xu et al., 2023).

1. Motivation and Conceptual Background

Conventional face recognition models utilize softmax-based classification losses, which cluster features from the same identity and separate those from different identities in the embedding space. However, the basic softmax loss does not explicitly enforce margins in angular space, which is critical to obtaining high inter-class separability in face verification.

Subsequent advancements—CosFace (AM-Softmax) and ArcFace—address this by imposing fixed additive margins, either in cosine or angular space, via

$f_C(\theta) = \cos\theta - m$

and

$f_A(\theta) = \cos(\theta + m),$

where $m$ is a fixed margin parameter.

Fixed margins prove inadequate in large-scale, imbalanced datasets like those used in face recognition. For class pairs with substantial angular separation, a large fixed margin may hinder optimization, while for tightly clustered classes, it may be insufficient to separate features effectively. Adaptive margins, which increase with the inter-class angle, promise greater model flexibility and improved discriminativity (Xu et al., 2023).

2. Mathematical Formalism

Let $x_i \in \mathbb{R}^d$ be the normalized feature vector of the $i$ th example and $W_j \in \mathbb{R}^d$ the normalized class center for class $j$ . The angle $\theta_{ij}$ between $x_i$ and class center $W_j$ is defined as

$f_A(\theta) = \cos(\theta + m),$ 0

A scale factor $f_A(\theta) = \cos(\theta + m),$ 1 controls logit sharpness (typically $f_A(\theta) = \cos(\theta + m),$ 2).

The standard softmax loss is

$f_A(\theta) = \cos(\theta + m),$ 3

In X²-Softmax, the logit for the ground-truth class is replaced by a quadratic,

$f_A(\theta) = \cos(\theta + m),$ 4

where $f_A(\theta) = \cos(\theta + m),$ 5, $f_A(\theta) = \cos(\theta + m),$ 6, and $f_A(\theta) = \cos(\theta + m),$ 7 are hyperparameters controlling the margin curvature, horizontal shift, and vertical shift, respectively. The resulting loss is

$f_A(\theta) = \cos(\theta + m),$ 8

Critically, the quadratic margin imposed for the ground-truth logit increases with the angular distance, thereby enforcing stronger penalties for pairs that are already well separated, while not overwhelming pairs that are near each other. In a two-class case, the margin $f_A(\theta) = \cos(\theta + m),$ 9 is shown to increase in proportion to the overall angular separation.

The gradient with respect to the angle,

$m$ 0

where

$m$ 1

ensures that the farther a sample is from its class center, the stronger the corrective force exerted by the margin.

3. Implementation and Optimization

The reference architecture is based on ResNet-50 with a 512-dimensional embedding layer. Training is conducted on the MS1Mv3 dataset (5.1 million images, 93,000 identities), following standard preprocessing steps: alignment to $m$ 2 pixels, normalization to $m$ 3.

Key hyperparameters:

Scale: $m$ 4
Quadratic margin: $m$ 5, $m$ 6, $m$ 7 (selected via ablation)
Optimizer: SGD with initial learning rate 0.02, momentum 0.9, weight decay $m$ 8
Batch size: 128
Total iterations: 1,052,000, with scheduled learning rate decays

A condensed training pseudocode is provided:

$x_i \in \mathbb{R}^d$ 8

4. Empirical Results and Comparative Analysis

X²-Softmax demonstrates state-of-the-art or competitive results across several standard face verification benchmarks. On the "LFW-style" evaluations using ResNet-50 features trained on MS1Mv3:

Benchmark	X²-Softmax (%)	ArcFace (%)	CosFace (%)
LFW	99.82	99.77	99.75
CALFW	95.92	96.03	-
CPLFW	91.67	92.05	-
AgeDB-30	97.83	98.18	-
CFP-FP	97.20	-	98.01
VGG2-FP	94.52	95.30	-

For IJB-B and IJB-C datasets (TAR at FAR):

Dataset	Metric	X²-Softmax	ArcFace
IJB-B	@1e-4	0.9495	0.9485
IJB-B	@1e-5	0.9133	0.9092
IJB-C	@1e-4	0.9624	0.9629
IJB-C	@1e-5	0.9459	0.9449

Convergence is on par with ArcFace, but with slightly reduced variance in the loss. Cosine similarity histograms for IJB-C indicate that X²-Softmax delivers lower overlap between positive and negative pairs, evidencing improved feature discriminability.

Ablation studies on $m$ 9 confirm optimal trade-off at $x_i \in \mathbb{R}^d$ 0, with larger $x_i \in \mathbb{R}^d$ 1 causing faster angle-dependent growth in margin but risking stability if too high.

5. Usage Guidelines and Optimization Strategies

Default hyperparameters for X²-Softmax are recommended as $x_i \in \mathbb{R}^d$ 2, $x_i \in \mathbb{R}^d$ 3, and SGD-based training with batch size 128 on large-scale face identification datasets.

Suggested tuning procedures:

For datasets with very tight inter-class angular separation, decrease $x_i \in \mathbb{R}^d$ 4 or increase $x_i \in \mathbb{R}^d$ 5 to enhance margin strength.
If instability or poor convergence is observed, reduce $x_i \in \mathbb{R}^d$ 6 or increase $x_i \in \mathbb{R}^d$ 7 to moderate the margin's penalization.
Regularly monitor a validation set to calibrate positive/negative pair overlap in the cosine distribution.

Noted limitations include the fixed quadratic form of the margin. More sophisticated or learnable margin functions, such as higher-degree polynomials or per-class adaptations, may capture inter-class relations more precisely, though at increased computational cost. Extensions to multi-scale or angular-adaptive losses (e.g., conditioning on feature norm) provide promising research directions.

6. Significance and Prospects

Through its quadratic, angular-adaptive margin, X²-Softmax provides discriminative feature learning that better adapts to class separation in face recognition tasks. This methodology yields improved or comparable accuracy on standard benchmarks and enhances feature separability through stronger, context-sensitive margins for well-separated classes. Future investigation into learnable or further adaptive margin strategies may further refine performance in complex, large-scale recognition settings (Xu et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

X2-Softmax: Margin Adaptive Loss Function for Face Recognition (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to X2-Softmax.