Multistage Focal Loss Function

Updated 9 August 2025

Multistage focal loss is a dynamic training strategy that adapts its focusing parameter across discrete phases to better handle hard examples and class imbalance.
It transitions from a stable convex phase to increasingly nonconvex phases, ensuring robust optimization and progressive hard-case mining.
Empirical studies show that this staged loss approach improves key metrics like recall and AUC in applications such as fraud detection and medical image analysis.

A multistage focal loss function refers to a training strategy or loss design in which the focus on hard-to-classify examples and the treatment of class imbalance are adaptively adjusted across discrete training phases or architectural components. This approach extends the well-known focal loss framework—which was originally proposed to address foreground-background class imbalance in dense detection—by leveraging dynamic or staged hyperparameter selection, architectural modularity, or progressive enhancement across multiple loss regimes. Multistage focal loss schemes have recently been formulated and empirically validated for improved robustness and accuracy in highly imbalanced tasks, notably in domains such as fraud detection, dense object detection, and medical image analysis.

1. Background: Focal Loss and Class Imbalance

The focal loss, initially introduced for dense object detection, is defined as:

$\text{FL}(p_t) = -\alpha_t (1-p_t)^\gamma \log(p_t)$

where $p_t$ is the model-predicted probability for the target class, $\alpha_t$ is a class-balancing weight, and $\gamma$ is the focusing parameter. The focal loss down-weights the loss assigned to well-classified (high $p_t$ ) examples, thus mitigating the effect of the vast majority of easy negatives in extremely imbalanced settings. The introduction of $\gamma$ (typically set to values such as $2$) makes the loss nonconvex, which can have implications for optimization stability and susceptibility to local minima (Lin et al., 2017).

One-stage detection frameworks such as RetinaNet have demonstrated the efficacy of focal loss in both accuracy and inference speed relative to two-stage detectors, confirming its centrality for handling sample imbalance (Wang et al., 2018).

2. Formalization of Multistage Focal Loss

A multistage focal loss function employs distinct convex and nonconvex regimes, dividing the training process into several discrete phases wherein the modulating parameter $\gamma$ (or its equivalent in a convex approximation) is increased between phases. The formal structuring is as follows (Boabang et al., 4 Aug 2025):

Phase	Loss Function Formulation	Regime	Range of γ
1. Convex phase (early epochs)	$-\alpha_t (1- \gamma p_t)\log p_t$	Convex	$0 < \gamma < 1/p_t$
2. Intermediate phase	$-\alpha_t (1-p_t)^{\gamma/2} \log p_t$	Mildly nonconvex	$0 < \gamma \leq 2$
3. Nonconvex phase (late)	$-\alpha_t (1-p_t)^{\gamma} \log p_t$	Nonconvex	Typically $\gamma \ge 2$

During the convex phase, the surrogate loss ensures an optimization landscape without poor local minima: the second derivative $f''(p) = (\gamma/p) + (1/p^2)>0$ for $p \in (0,1)$ renders $f(p)$ strictly convex.
The subsequent nonconvexity is introduced gradually, with $\gamma$ increased to accentuate the penalization of easy examples in later training phases.

3. Motivation and Theoretical Properties

Optimization Stability

Initialization with a convex variant ensures that the model first learns generalizable global patterns and avoids the risk of early trapping in local minima—a limitation known for nonconvex losses (Boabang et al., 4 Aug 2025).

Progressive Hard Case Mining

Progressively increasing $\gamma$ or altering the loss polynomial’s degree enforces “progressive focus” on misclassified or rare examples as the model converges, facilitating improved discrimination between classes in highly imbalanced distributions (Wu et al., 2021). This multistage attention shift directly addresses the fact that a static focusing parameter is suboptimal for all training epochs.

4. Empirical Validation and Performance

Extensive experimental results in auto insurance fraud detection demonstrate that models utilizing the multistage focal loss achieve superior performance compared to single-phase convex or fixed-γ (nonconvex) regimes (Boabang et al., 4 Aug 2025). On a real insurance dataset:

Schedule	Loss	Accuracy	Precision	Recall	F1-score	AUC
Convex only	0.6592	0.6011	0.6013	0.6028	0.6017	0.6538
Nonconvex (γ=2)	0.1634	0.6011	0.6013	0.6028	0.6017	0.6538
Nonconvex (γ=4)	0.0409	0.6074	0.6094	0.6124	0.6107	0.6766
Multistage	0.0428	0.6277	0.6270	0.6602	0.6346	0.6828

This demonstrates that the staged transition from convex to nonconvex regimes yields improvements in key classification metrics including recall and AUC, both critical for fraud detection applications. These results are attributed to the model’s ability to learn broad discriminative features early before dedicating capacity to rare/minority-class examples in later training.

5. Interpretation and Model Explainability

The integration of SHAP (SHapley Additive exPlanations) analysis provides insights into feature attributions across training stages (Boabang et al., 4 Aug 2025). It is observed that:

During the convex stage, SHAP values reflect broad, general patterns, with feature influences distributed more uniformly.
In nonconvex phases, feature attributions highlight rare or complex decision boundaries, especially those contributing to minority class detection (fraudulent cases).

For the multistage approach, the dispersion of SHAP values across features suggests that the multistage strategy avoids excessive reliance on a single feature and improves model robustness by balancing global pattern learning and specialized focus on hard examples.

6. Relation to Broader Adaptive and Hierarchical Focal Loss Variants

The multistage approach is distinct from adaptive schedule methods (where γ is continuously updated based on sample-hardness or performance (Weber et al., 2019)). In the multistage paradigm, discrete transitions in the loss regime correspond to stages in the training process, providing curriculum-like stability and robust convergence—a principle also employed in hierarchical progressive focus for multi-scale detectors (Wu et al., 2021).

Moreover, multistage focal loss can be conceptualized as a generalization; it encompasses methods where the loss landscape is engineered (through staged convexity or polynomial term adjustments) to facilitate stable optimization followed by aggressive minority mining, with broad applicability to tabular, detection, and segmentation tasks suffering from severe class imbalance.

7. Practical Implementation and Recommendations

For practical deployment:

Begin training with a convex (γ ≈ 0) focal loss for E₁ epochs.
Transition to an intermediate or full nonconvex focal loss (γ = 2, then γ = 4 or task-specific optimal) for subsequent training phases.
The transition points (E₁, E₂) and the rate of γ increase should be set based on observed validation loss convergence or through hyperparameter search.

The explained scheme is robust for high class imbalance settings and appropriate for tasks where both initial training stability and sharp focus on hard examples are essential. Integration with explainable AI tools such as SHAP supports transparency and validation in regulated settings.

In summary, the multistage focal loss function enhances the foundational focal loss by introducing a staged, curriculum-inspired sequence from convex to nonconvex training. Empirical studies confirm its advantage in robustness, accuracy, and recall in highly imbalanced classification, with explainability tools supporting model transparency and trust in real-world deployments (Boabang et al., 4 Aug 2025).