Feature Balancing Loss (FBL) Overview

Updated 16 December 2025

Feature Balancing Loss (FBL) is a method for mitigating representation bias in deep neural networks by applying a class-specific, curriculum-controlled negative logit bias.
It scales the bias based on class frequency and feature norm, encouraging robust feature learning and improving tail-class discriminability.
Empirical benchmarks on CIFAR-10-LT, CIFAR-100-LT, ImageNet-LT, and iNaturalist 2018 demonstrate that FBL significantly narrows accuracy gaps between head and tail classes.

Feature Balancing Loss (FBL) addresses representation bias in deep neural networks exposed to long-tailed class distributions, where a few classes (head) dominate and many others (tail) are underrepresented. The core principle of FBL is to counteract the smaller feature norms and weaker representation learning typical for tail classes. It achieves this by imposing a calibrated negative bias on the logits of each class during training, scaling the stimulus based on class frequency and gradually increasing its influence through a curriculum schedule. This mechanism encourages neural networks to develop more robust and discriminative features for tail classes, thereby reducing accuracy gaps without sacrificing head-class performance (Li et al., 2023).

1. Mathematical Formulation and Objective

Define a dataset $T$ of $N$ samples across $C$ classes. Each sample $x$ is mapped by the neural network to a feature vector $f \in \mathbb{R}^D$ from the penultimate layer. The classifier weights are $W=[w_1, ..., w_C]\in\mathbb{R}^{D\times C}$ , and the standard logit for class $j$ is $z_j=w_j^T f$ . Denote $n_j$ as the number of training samples in class $j$ , and $n_{\max} = \max_j n_j$ .

FBL introduces a class-wise negative stimulus $\lambda_j = \log n_{\max} - \log n_j$ , ensuring $\lambda_j \ge 0$ with the largest value for the rarest (tail) classes and zero for the most frequent class. The scheduled strength $\alpha(t)$ modulates the effect over training epochs.

The feature-balanced logit for class $j$ : $z^b_j = z_j - \alpha(t)\frac{\lambda_j}{\|f\|_2}$

FBL loss for a mini-batch of $B$ samples: $L_{\text{FBL}} = -\frac{1}{B} \sum_{i=1}^B \log \frac{\exp(z^b_{y_i})}{\sum_{j=1}^C \exp(z^b_j)}$ where $y_i$ is the ground-truth class for sample $i$ . The formulation is plug-and-play, replacing the standard logits in the cross-entropy with $z^b_j$ .

2. Feature Norm Dynamics and Rationale

Under standard softmax cross-entropy, classifier weight norms $\|w_j\|$ increase with class frequency $n_j$ . Tail-class feature norms $\|f\|$ also tend to collapse, yielding smaller logits and degraded tail performance. For a fixed classifier, increasing $\|f\|$ monotonically decreases the softmax loss for correctly classified samples, effectively improving classification margins.

By adding a negative bias $-\alpha \lambda_j / \|f\|$ to the logit, FBL compensates for this imbalance. The network is incentivized to increase tail-class feature norms to overcome the bias, equalizing geometry in the feature space across classes. This mechanism encourages robust clusterings for tail classes, stabilizing mean feature norms and mitigating mode collapse (Li et al., 2023).

3. Curriculum Scheduling Strategies

A key aspect of FBL is the curriculum-based scaling of the bias via $\alpha(t)$ , where $t$ is the current epoch and $T$ the total number of epochs. If the stimulus is applied too early or too strongly, gradients on head classes are insufficient, impairing their accuracy. Thus, $\alpha(t)$ is set to grow gradually, with several schedules evaluated:

Linear increase: $\alpha(t) = t/T$
Sinusoidal: $\alpha(t) = \sin(\frac{\pi}{2}\frac{t}{T})$
Cosine: $\alpha(t) = 1-\cos(\frac{\pi}{2}\frac{t}{T})$
Parabolic (default and best-performing): $\alpha(t) = (t/T)^2$

Early epochs approximate standard cross-entropy, prioritizing initial learning of head classes. Later, the effect increases, primarily benefiting tail-class representation.

4. Implementation and Training Considerations

Implementing FBL requires minimal modification to standard neural network training:

Compute $\|f\|_2$ after the embedding layer; ensure numerical stability by adding $\epsilon$ if necessary.
Precompute $\lambda_j$ for all classes from the training distribution.
At each training step, determine $\alpha(t)$ using the selected schedule.
The negative bias is computed and subtracted from logits prior to applying the softmax cross-entropy.
FBL is compatible with standard weight decay, data augmentation, and other loss terms.
No additional normalization beyond standard practice for weights or features is required.

SGD with momentum and conventional learning-rate schedules are used for optimization. No hyperparameter beyond the curriculum schedule and precomputed $\lambda_j$ is required.

5. Empirical Benchmarks and Effectiveness

FBL has been evaluated on standard long-tailed recognition benchmarks including CIFAR-10-LT, CIFAR-100-LT, ImageNet-LT, Places-LT, and iNaturalist 2018. The table below summarizes representative top-1 accuracy results:

Dataset (IMB=Imbalance)	Baseline (CE)	LDAM-DRW	LA	FBL
CIFAR-10-LT (IF=100)	71.07	77.03	80.92	82.46
CIFAR-100-LT (IF=100)	39.43	42.04	—	45.22
ImageNet-LT	44.51	48.80	50.44	50.70
iNaturalist 2018	63.80	—	—	69.90
Places-LT	27.13	—	—	38.66

Ablation studies for $\alpha(t)$ schedules on CIFAR-10-LT (IF=100) show that the parabolic schedule outperforms linear, cosine, and sine increases. Feature-norm visualizations indicate that FBL consistently enlarges tail-class norms while maintaining head-class norms. Per-class accuracies confirm substantial gains for rarest classes (e.g., up to 81.9% accuracy for a tail class vs. 44.6% for cross-entropy), with negligible cost on head classes (Li et al., 2023).

Complementary approaches such as Distributional Robustness Loss (DRO-LT) (Samuel et al., 2021) apply robustness theory directly in the feature space. DRO-LT defines empirical centroids for each class and constructs an uncertainty ball of radius $\varepsilon_c$ around each centroid. The training loss optimizes against the worst-case shift within this ball, guarding against centroid estimation error in rare classes. Empirical tuning or learning of the $\varepsilon_c$ parameters improves tail robustness. In both methods, the objective is to improve tail-class feature discriminability while preserving head-class performance; empirical results show DRO-LT obtains increased accuracy for tail classes and competitive top-1 results in CIFAR-100-LT, ImageNet-LT, and iNaturalist 2018 (Samuel et al., 2021). This suggests a growing recognition of the necessity not just for classifier weight balancing but for direct intervention in feature space learning.

7. Broader Significance in Long-Tailed Recognition

Feature Balancing Loss represents an efficient and easily integrated bias correction for long-tailed learning scenarios. By focusing on correcting the feature-space geometry, FBL circumvents the need for explicit class re-weighting or elaborate classifier manipulations. It demonstrates empirically that a calibrated, curriculum-tuned logit bias can yield state-of-the-art results across standard benchmarks, offering a plug-and-play approach for neural network practitioners confronted with real-world heavy-tailed data distributions (Li et al., 2023). The method’s success has influenced related research into feature-level robustness, as exemplified by DRO-LT, underscoring a central trend in modern visual recognition: robust, class-agnostic feature learning is as vital as classifier construction for narrowing head-to-tail accuracy gaps.