Papers
Topics
Authors
Recent
Search
2000 character limit reached

X²-Softmax: Adaptive Quadratic Angular Loss

Updated 30 June 2026
  • X²-Softmax is an angular-margin loss function that employs a quadratic, angle-dependent margin to boost discriminative feature learning in face recognition.
  • It adaptively strengthens margins based on the angular separation between class centers, addressing challenges in imbalanced and closely spaced classes.
  • Empirical results indicate that X²-Softmax achieves competitive or superior performance compared to fixed-margin losses like ArcFace and CosFace on standard benchmarks.

X²-Softmax is an angular-margin loss function designed to improve face recognition performance by adaptively increasing the separation between class centers as their angular distance grows. Unlike fixed-margin losses such as CosFace and ArcFace, X²-Softmax introduces a quadratic, angle-dependent margin that offers enhanced flexibility in class separation, particularly in the presence of inter-class imbalance and varied angular distributions. This approach seeks to ensure that classes with large angular separations are enforced with stronger margins, thereby increasing discriminative power without impeding convergence for tightly spaced classes (Xu et al., 2023).

1. Motivation and Conceptual Background

Conventional face recognition models utilize softmax-based classification losses, which cluster features from the same identity and separate those from different identities in the embedding space. However, the basic softmax loss does not explicitly enforce margins in angular space, which is critical to obtaining high inter-class separability in face verification.

Subsequent advancements—CosFace (AM-Softmax) and ArcFace—address this by imposing fixed additive margins, either in cosine or angular space, via

fC(θ)=cosθmf_C(\theta) = \cos\theta - m

and

fA(θ)=cos(θ+m),f_A(\theta) = \cos(\theta + m),

where mm is a fixed margin parameter.

Fixed margins prove inadequate in large-scale, imbalanced datasets like those used in face recognition. For class pairs with substantial angular separation, a large fixed margin may hinder optimization, while for tightly clustered classes, it may be insufficient to separate features effectively. Adaptive margins, which increase with the inter-class angle, promise greater model flexibility and improved discriminativity (Xu et al., 2023).

2. Mathematical Formalism

Let xiRdx_i \in \mathbb{R}^d be the normalized feature vector of the iith example and WjRdW_j \in \mathbb{R}^d the normalized class center for class jj. The angle θij\theta_{ij} between xix_i and class center WjW_j is defined as

fA(θ)=cos(θ+m),f_A(\theta) = \cos(\theta + m),0

A scale factor fA(θ)=cos(θ+m),f_A(\theta) = \cos(\theta + m),1 controls logit sharpness (typically fA(θ)=cos(θ+m),f_A(\theta) = \cos(\theta + m),2).

The standard softmax loss is

fA(θ)=cos(θ+m),f_A(\theta) = \cos(\theta + m),3

In X²-Softmax, the logit for the ground-truth class is replaced by a quadratic,

fA(θ)=cos(θ+m),f_A(\theta) = \cos(\theta + m),4

where fA(θ)=cos(θ+m),f_A(\theta) = \cos(\theta + m),5, fA(θ)=cos(θ+m),f_A(\theta) = \cos(\theta + m),6, and fA(θ)=cos(θ+m),f_A(\theta) = \cos(\theta + m),7 are hyperparameters controlling the margin curvature, horizontal shift, and vertical shift, respectively. The resulting loss is

fA(θ)=cos(θ+m),f_A(\theta) = \cos(\theta + m),8

Critically, the quadratic margin imposed for the ground-truth logit increases with the angular distance, thereby enforcing stronger penalties for pairs that are already well separated, while not overwhelming pairs that are near each other. In a two-class case, the margin fA(θ)=cos(θ+m),f_A(\theta) = \cos(\theta + m),9 is shown to increase in proportion to the overall angular separation.

The gradient with respect to the angle,

mm0

where

mm1

ensures that the farther a sample is from its class center, the stronger the corrective force exerted by the margin.

3. Implementation and Optimization

The reference architecture is based on ResNet-50 with a 512-dimensional embedding layer. Training is conducted on the MS1Mv3 dataset (5.1 million images, 93,000 identities), following standard preprocessing steps: alignment to mm2 pixels, normalization to mm3.

Key hyperparameters:

  • Scale: mm4
  • Quadratic margin: mm5, mm6, mm7 (selected via ablation)
  • Optimizer: SGD with initial learning rate 0.02, momentum 0.9, weight decay mm8
  • Batch size: 128
  • Total iterations: 1,052,000, with scheduled learning rate decays

A condensed training pseudocode is provided:

xiRdx_i \in \mathbb{R}^d8

4. Empirical Results and Comparative Analysis

X²-Softmax demonstrates state-of-the-art or competitive results across several standard face verification benchmarks. On the "LFW-style" evaluations using ResNet-50 features trained on MS1Mv3:

Benchmark X²-Softmax (%) ArcFace (%) CosFace (%)
LFW 99.82 99.77 99.75
CALFW 95.92 96.03 -
CPLFW 91.67 92.05 -
AgeDB-30 97.83 98.18 -
CFP-FP 97.20 - 98.01
VGG2-FP 94.52 95.30 -

For IJB-B and IJB-C datasets (TAR at FAR):

Dataset Metric X²-Softmax ArcFace
IJB-B @1e-4 0.9495 0.9485
IJB-B @1e-5 0.9133 0.9092
IJB-C @1e-4 0.9624 0.9629
IJB-C @1e-5 0.9459 0.9449

Convergence is on par with ArcFace, but with slightly reduced variance in the loss. Cosine similarity histograms for IJB-C indicate that X²-Softmax delivers lower overlap between positive and negative pairs, evidencing improved feature discriminability.

Ablation studies on mm9 confirm optimal trade-off at xiRdx_i \in \mathbb{R}^d0, with larger xiRdx_i \in \mathbb{R}^d1 causing faster angle-dependent growth in margin but risking stability if too high.

5. Usage Guidelines and Optimization Strategies

Default hyperparameters for X²-Softmax are recommended as xiRdx_i \in \mathbb{R}^d2, xiRdx_i \in \mathbb{R}^d3, and SGD-based training with batch size 128 on large-scale face identification datasets.

Suggested tuning procedures:

  • For datasets with very tight inter-class angular separation, decrease xiRdx_i \in \mathbb{R}^d4 or increase xiRdx_i \in \mathbb{R}^d5 to enhance margin strength.
  • If instability or poor convergence is observed, reduce xiRdx_i \in \mathbb{R}^d6 or increase xiRdx_i \in \mathbb{R}^d7 to moderate the margin's penalization.
  • Regularly monitor a validation set to calibrate positive/negative pair overlap in the cosine distribution.

Noted limitations include the fixed quadratic form of the margin. More sophisticated or learnable margin functions, such as higher-degree polynomials or per-class adaptations, may capture inter-class relations more precisely, though at increased computational cost. Extensions to multi-scale or angular-adaptive losses (e.g., conditioning on feature norm) provide promising research directions.

6. Significance and Prospects

Through its quadratic, angular-adaptive margin, X²-Softmax provides discriminative feature learning that better adapts to class separation in face recognition tasks. This methodology yields improved or comparable accuracy on standard benchmarks and enhances feature separability through stronger, context-sensitive margins for well-separated classes. Future investigation into learnable or further adaptive margin strategies may further refine performance in complex, large-scale recognition settings (Xu et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to X2-Softmax.