Papers
Topics
Authors
Recent
2000 character limit reached

Feature Balancing Loss (FBL) Overview

Updated 16 December 2025
  • Feature Balancing Loss (FBL) is a method for mitigating representation bias in deep neural networks by applying a class-specific, curriculum-controlled negative logit bias.
  • It scales the bias based on class frequency and feature norm, encouraging robust feature learning and improving tail-class discriminability.
  • Empirical benchmarks on CIFAR-10-LT, CIFAR-100-LT, ImageNet-LT, and iNaturalist 2018 demonstrate that FBL significantly narrows accuracy gaps between head and tail classes.

Feature Balancing Loss (FBL) addresses representation bias in deep neural networks exposed to long-tailed class distributions, where a few classes (head) dominate and many others (tail) are underrepresented. The core principle of FBL is to counteract the smaller feature norms and weaker representation learning typical for tail classes. It achieves this by imposing a calibrated negative bias on the logits of each class during training, scaling the stimulus based on class frequency and gradually increasing its influence through a curriculum schedule. This mechanism encourages neural networks to develop more robust and discriminative features for tail classes, thereby reducing accuracy gaps without sacrificing head-class performance (Li et al., 2023).

1. Mathematical Formulation and Objective

Define a dataset TT of NN samples across CC classes. Each sample xx is mapped by the neural network to a feature vector fRDf \in \mathbb{R}^D from the penultimate layer. The classifier weights are W=[w1,...,wC]RD×CW=[w_1, ..., w_C]\in\mathbb{R}^{D\times C}, and the standard logit for class jj is zj=wjTfz_j=w_j^T f. Denote njn_j as the number of training samples in class jj, and nmax=maxjnjn_{\max} = \max_j n_j.

FBL introduces a class-wise negative stimulus λj=lognmaxlognj\lambda_j = \log n_{\max} - \log n_j, ensuring λj0\lambda_j \ge 0 with the largest value for the rarest (tail) classes and zero for the most frequent class. The scheduled strength α(t)\alpha(t) modulates the effect over training epochs.

The feature-balanced logit for class jj: zjb=zjα(t)λjf2z^b_j = z_j - \alpha(t)\frac{\lambda_j}{\|f\|_2}

FBL loss for a mini-batch of BB samples: LFBL=1Bi=1Blogexp(zyib)j=1Cexp(zjb)L_{\text{FBL}} = -\frac{1}{B} \sum_{i=1}^B \log \frac{\exp(z^b_{y_i})}{\sum_{j=1}^C \exp(z^b_j)} where yiy_i is the ground-truth class for sample ii. The formulation is plug-and-play, replacing the standard logits in the cross-entropy with zjbz^b_j.

2. Feature Norm Dynamics and Rationale

Under standard softmax cross-entropy, classifier weight norms wj\|w_j\| increase with class frequency njn_j. Tail-class feature norms f\|f\| also tend to collapse, yielding smaller logits and degraded tail performance. For a fixed classifier, increasing f\|f\| monotonically decreases the softmax loss for correctly classified samples, effectively improving classification margins.

By adding a negative bias αλj/f-\alpha \lambda_j / \|f\| to the logit, FBL compensates for this imbalance. The network is incentivized to increase tail-class feature norms to overcome the bias, equalizing geometry in the feature space across classes. This mechanism encourages robust clusterings for tail classes, stabilizing mean feature norms and mitigating mode collapse (Li et al., 2023).

3. Curriculum Scheduling Strategies

A key aspect of FBL is the curriculum-based scaling of the bias via α(t)\alpha(t), where tt is the current epoch and TT the total number of epochs. If the stimulus is applied too early or too strongly, gradients on head classes are insufficient, impairing their accuracy. Thus, α(t)\alpha(t) is set to grow gradually, with several schedules evaluated:

  • Linear increase: α(t)=t/T\alpha(t) = t/T
  • Sinusoidal: α(t)=sin(π2tT)\alpha(t) = \sin(\frac{\pi}{2}\frac{t}{T})
  • Cosine: α(t)=1cos(π2tT)\alpha(t) = 1-\cos(\frac{\pi}{2}\frac{t}{T})
  • Parabolic (default and best-performing): α(t)=(t/T)2\alpha(t) = (t/T)^2

Early epochs approximate standard cross-entropy, prioritizing initial learning of head classes. Later, the effect increases, primarily benefiting tail-class representation.

4. Implementation and Training Considerations

Implementing FBL requires minimal modification to standard neural network training:

  • Compute f2\|f\|_2 after the embedding layer; ensure numerical stability by adding ϵ\epsilon if necessary.
  • Precompute λj\lambda_j for all classes from the training distribution.
  • At each training step, determine α(t)\alpha(t) using the selected schedule.
  • The negative bias is computed and subtracted from logits prior to applying the softmax cross-entropy.
  • FBL is compatible with standard weight decay, data augmentation, and other loss terms.
  • No additional normalization beyond standard practice for weights or features is required.

SGD with momentum and conventional learning-rate schedules are used for optimization. No hyperparameter beyond the curriculum schedule and precomputed λj\lambda_j is required.

5. Empirical Benchmarks and Effectiveness

FBL has been evaluated on standard long-tailed recognition benchmarks including CIFAR-10-LT, CIFAR-100-LT, ImageNet-LT, Places-LT, and iNaturalist 2018. The table below summarizes representative top-1 accuracy results:

Dataset (IMB=Imbalance) Baseline (CE) LDAM-DRW LA FBL
CIFAR-10-LT (IF=100) 71.07 77.03 80.92 82.46
CIFAR-100-LT (IF=100) 39.43 42.04 45.22
ImageNet-LT 44.51 48.80 50.44 50.70
iNaturalist 2018 63.80 69.90
Places-LT 27.13 38.66

Ablation studies for α(t)\alpha(t) schedules on CIFAR-10-LT (IF=100) show that the parabolic schedule outperforms linear, cosine, and sine increases. Feature-norm visualizations indicate that FBL consistently enlarges tail-class norms while maintaining head-class norms. Per-class accuracies confirm substantial gains for rarest classes (e.g., up to 81.9% accuracy for a tail class vs. 44.6% for cross-entropy), with negligible cost on head classes (Li et al., 2023).

Complementary approaches such as Distributional Robustness Loss (DRO-LT) (Samuel et al., 2021) apply robustness theory directly in the feature space. DRO-LT defines empirical centroids for each class and constructs an uncertainty ball of radius εc\varepsilon_c around each centroid. The training loss optimizes against the worst-case shift within this ball, guarding against centroid estimation error in rare classes. Empirical tuning or learning of the εc\varepsilon_c parameters improves tail robustness. In both methods, the objective is to improve tail-class feature discriminability while preserving head-class performance; empirical results show DRO-LT obtains increased accuracy for tail classes and competitive top-1 results in CIFAR-100-LT, ImageNet-LT, and iNaturalist 2018 (Samuel et al., 2021). This suggests a growing recognition of the necessity not just for classifier weight balancing but for direct intervention in feature space learning.

7. Broader Significance in Long-Tailed Recognition

Feature Balancing Loss represents an efficient and easily integrated bias correction for long-tailed learning scenarios. By focusing on correcting the feature-space geometry, FBL circumvents the need for explicit class re-weighting or elaborate classifier manipulations. It demonstrates empirically that a calibrated, curriculum-tuned logit bias can yield state-of-the-art results across standard benchmarks, offering a plug-and-play approach for neural network practitioners confronted with real-world heavy-tailed data distributions (Li et al., 2023). The method’s success has influenced related research into feature-level robustness, as exemplified by DRO-LT, underscoring a central trend in modern visual recognition: robust, class-agnostic feature learning is as vital as classifier construction for narrowing head-to-tail accuracy gaps.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Feature Balancing Loss (FBL).