Papers
Topics
Authors
Recent
2000 character limit reached

Distribution-Balanced Focal Loss

Updated 2 December 2025
  • Distribution-Balanced Focal Loss is a method that integrates dynamic focusing with class-dependent weighting to address class imbalance in challenging classification tasks.
  • It incorporates techniques like static weight adjustments, linear scheduling, and gradient-based adaptations to improve model focus on difficult or rare examples.
  • Empirical results in anti-spoofing and object detection demonstrate measurable improvements in error rates and average precision, validating its effectiveness in imbalanced scenarios.

Distribution-Balanced Focal Loss (BFL) refers to a family of loss functions that combine the principles of focal loss—dynamically emphasizing difficult, misclassified, or rare examples—with explicit mechanisms for rectifying class imbalance. These techniques are particularly effective in domains with highly skewed class distributions or difficult hard-to-classify examples, such as anti-spoofing in speaker verification and long-tailed or dense object detection.

1. Core Motivation and Definition

The primary motivation for distribution-balanced focal losses is twofold:

  • Many real-world classification tasks exhibit significant class imbalance, where rare classes are underrepresented and tend to be harder for models to classify, and
  • Standard loss functions such as cross-entropy do not adequately focus on hard (ambiguous or misclassified) samples.

Focal loss modulates the standard log-loss for each sample by a factor (1pt)γ(1-p_t)^\gamma, where ptp_t is the predicted probability assigned to the ground-truth class. This down-weights the contribution of well-classified (easy) samples and places greater emphasis on hard examples. However, focal loss alone does not account for global class imbalance.

Distribution-balanced focal losses, such as Balanced Focal Loss (BFL), Balance-Oriented Focal Loss (BOFL), and Equalized Focal Loss (EFL), augment the original focal loss with additional mechanisms—class-dependent weighting, dynamic scheduling, or adaptive focusing factors—to rectify class imbalance at both the loss and gradient level (Dou et al., 2020, Gil et al., 2020, Li et al., 2022).

2. Mathematical Formulations

The general form of a distribution-balanced focal loss can be illustrated by the Balanced Focal Loss (BFL) (Dou et al., 2020):

LBFL=1Ni=1Nαyi(1pt,i)γlog(pt,i)\mathcal{L}_{\mathrm{BFL}} = -\frac{1}{N}\sum_{i=1}^N \alpha_{y_i}\, (1-p_{t,i})^{\gamma}\, \log(p_{t,i})

Where:

  • yi{0,1}y_i\in\{0,1\} is the ground-truth class,
  • pt,i=pip_{t,i}=p_i if yi=1y_i=1, pt,i=1pip_{t,i}=1-p_i if yi=0y_i=0,
  • αyi\alpha_{y_i} is a static class weight, often inversely proportional to the number of training samples of class yiy_i, and normalized to sum to 2,
  • γ0\gamma\geq 0 is the focusing parameter controlling loss attenuation for easy examples.

More advanced variants further refine the weighting and focus. For instance, BOFL (Gil et al., 2020) introduces batch-wise class weights and epoch-dependent (linearly scheduled) scaling:

wi,k=α^i(t)ηni,kw_{i,k} = \hat{\alpha}_i(t) \cdot \eta^{n_{i,k}}

where α^i(t)\hat{\alpha}_i(t) is ramped from 1 to the full inverse-frequency value across epochs, and ni,kn_{i,k} is the number of occurrences of class ii in batch kk.

EFL (Li et al., 2022) applies a category-specific focusing factor γj=γb+γvj\gamma^j = \gamma_b + \gamma_v^j, where γvj\gamma_v^j is dynamically elevated for under-trained classes based on per-class running averages of gradient statistics.

3. Derivation and Rationale for Dynamic Terms

Distribution-balanced focal losses emerge by sequentially addressing two limitations of standard loss designs:

a) Standard cross-entropy loss is dominated by the majority class, since frequent classes contribute disproportionately to the total loss.

b) Even loss functions with class weighting (e.g., αt\alpha_t) are still dominated by easy examples, especially early in training, because logpt-\log p_t decreases slowly even for well-predicted samples.

The focal loss component (1pt)γ(1-p_t)^\gamma resolves (b) by driving the loss toward zero for correctly classified cases and amplifying the gradient for difficult or rare samples—this is the dynamic focusing factor. Explicit class weights, adaptively scheduled or batch-scaled, address (a) by restoring balance between rare and common classes.

Advanced methods such as EFL replace static focusing parameters with category- and iteration-dependent values, guided by the instantaneous balance of positive and negative gradient magnitudes. This allows tail classes to receive significantly higher focus, while the overall gradient magnitude is rescaled to prevent total loss starvation for those classes.

4. Implementation and Integration

The following archetypical recipe illustrates integration (presented for BFL) (Dou et al., 2020):

1
2
3
4
5
6
7
8
9
10
11
12
for epoch in range(num_epochs):
  for x_batch, y_batch in train_loader:
    logits = model(x_batch)
    probs = softmax(logits, dim=1)
    p_t = where(y_batch==1, probs[:,1], 1 - probs[:,1])
    alpha_t = where(y_batch==1, alpha_dict[1], alpha_dict[0])
    focal_factor = (1.0 - p_t).pow(gamma)
    loss_terms = - alpha_t * focal_factor * log(p_t + eps)
    loss = loss_terms.mean()
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Notes:

  • αy\alpha_{y} is normalized globally; any monotonic rescaling can be absorbed by the learning rate.
  • For BOFL, the per-class weights are further modulated by epoch-wise and batch-wise scheduling; see (Gil et al., 2020) for full pseudocode implementation.
  • For EFL, an exponential moving average of per-class accumulated gradient ratios is maintained, and γj\gamma^j per class is updated each iteration using gjg^j; loss backpropagation and gradient accumulation require no structural modification to standard one-stage detector pipelines (Li et al., 2022).

5. Hyperparameter Selection and Scheduling

Key hyperparameters and heuristics across distribution-balanced focal losses are:

  • Class weights αy\alpha_y: Derived as 1/Ny\propto 1/N_y per class, normalized such that iαi=2\sum_i \alpha_i=2 (BFL, BOFL). For BOFL, linear scheduling ramps weights from unity to full inverse-frequency over initial epochs, using a normalized ramp parameter λ\lambda and deferred re-weighting (e.g., activated after E0E_0 epochs).
  • Focusing parameter γ\gamma: For BFL and BOFL, grid search over γ{0,1,2,5,10}\gamma\in\{0,1,2,5,10\} identified γ=2\gamma=2 as optimal.
  • Batch-wise scaling (BOFL): Per-class, per-batch weights wi,kw_{i,k} use an additional hyperparameter η[0,1]\eta\in[0,1] to down-weight head classes further within each batch.
  • Dynamic focusing/weighting (EFL): For EFL, γj=γb+s(1gj)\gamma^j=\gamma_b + s(1-g^j) with ss (recommended s=8s=8) scaling the class-specific focus, and EMA momentum λ[0.9,0.99]\lambda\in[0.9,0.99] governing update smoothness for the balance measure gjg^j.

6. Empirical Performance and Comparative Analysis

Empirical evaluations across domains consistently show that distribution-balanced focal losses outperform both standard cross-entropy and vanilla focal loss, with and without additional balancing mechanisms.

Anti-Spoofing / Replay Attack Detection (ASVspoof2019, BFL) (Dou et al., 2020):

  • ResNet+MGD-gram:
    • BCE: min-tDCF=0.0288, EER=1.07%
    • BFL (γ=2\gamma=2): min-tDCF=0.0257 (↓11%), EER=1.04% (↓3%)
  • 3-model fusion (STFT, MGD, CQT):
    • BCE: min-tDCF=0.0151, EER=0.61%
    • BFL: min-tDCF=0.0124 (↓18%), EER=0.55% (↓10%)
  • BFL particularly reduces errors on the hardest (AA-type) replay attacks.

Object Detection:

  • BOFL (MS-COCO, CenterNet) (Gil et al., 2020):
    • Focal Loss baseline: 26.4 AP
    • BOFL: 27.6 AP (+1.2), improvement retained across other backbones (MobileNetV3, ResNet-18, DLA-34)
    • Outperforms class-balanced focal loss and Equalization Loss, whose direct application degrades AP relative to BOFL's linearly scheduled balancing.
  • EFL (LVIS v1, dense long-tailed detection) (Li et al., 2022):
    • Baseline (standard FL, ResNet-50): AP=25.7
    • EFL: AP=27.5 (+1.8), AP_r=20.2 (+5.9 in "rare" categories)
    • EFL (ResNet-101): AP=29.2 (+2.2), AP_r=23.5 (+9.1)
    • Outperforms two-stage balancing losses (EQL, EQLv2, Seesaw) by 1–2 AP.

Empirical analyses indicate that linearly scheduled, batch-wise, or dynamically focused losses prevent destabilization in early training and more effectively mitigate imbalance, especially for conventional (non-extreme few-shot) long tails and high-impact rare samples or classes.

7. Methodological Differentiation and Extensions

Distribution-balanced focal losses are distinct from other balancing losses as follows:

Loss Family Static Class Weight Dynamic Focusing Factor Batch-wise/Gradient Modulation Scheduling
Standard Focal Loss [Lin et al.] αt\alpha_t (1pt)γ(1-p_t)^\gamma No No
BFL (Dou et al., 2020) αy\alpha_{y} (1pt)γ(1-p_t)^\gamma No No
BOFL (Gil et al., 2020) α^i(t)\hat{\alpha}_i(t) (1pit)γ(1-p^t_i)^\gamma Batch- and epoch-wise Linear epoch ramp
EFL (Li et al., 2022) αtj\alpha_t^j (1ptj)γj(1-p_t^j)^{\gamma^j} Per-class grad-statistics EMA updatable

EFL generalizes focal loss by making the focusing exponent category- and iteration-dependent; BOFL extends inverse-frequency reweighting smoothly in time and across batches; BFL is a key representative of the original class-weighted focal loss paradigm, particularly impactful in anti-spoofing.

A notable implication is that optimal reweighting and focusing cannot be achieved through static class priors alone—both dynamic scheduling (BOFL) and gradient statistic feedback (EFL) are essential for realizing full gains in difficult, imbalanced tasks.

8. Limitations and Future Directions

Distribution-balanced focal losses require careful batch and class-statistics computation and potentially large batch sizes for stable gradient estimates (EFL). Extremely stochastic datasets or tiny batch regimes may impede the dynamic adaptation process. Additionally, the normalization and scheduling of class weights must be coordinated with learning rate policies to ensure stable convergence.

A plausible implication is that further generalization of these losses could involve meta-learned focusing and weighting parameters, adaptive scheduling, or integration into auxiliary sample selection or augmentation policies.


Key References:

  • Dynamically Mitigating Data Discrepancy with Balanced Focal Loss for Replay Attack Detection (Dou et al., 2020)
  • Balance-Oriented Focal Loss with Linear Scheduling for Anchor Free Object Detection (Gil et al., 2020)
  • Equalized Focal Loss for Dense Long-Tailed Object Detection (Li et al., 2022)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Distribution-Balanced Focal Loss.