MarginMSE-Loss: Imbalanced Classification

Updated 6 May 2026

MarginMSE-Loss is a classification loss that modifies MSE by introducing an outlying label to create class-dependent margins for rare classes.
The method ensures balanced gradient updates across all output neurons by scaling one-hot targets using computed class ranks and a tunable margin hyperparameter α.
Empirical results show notable improvements in metrics like Top-1 accuracy and F-measure on datasets such as CIFAR-10 and CIFAR-100 compared to standard cross-entropy.

MarginMSE-Loss is a classification loss function designed for imbalanced data scenarios, where standard cross-entropy objectives underperform on minority classes. It modifies the conventional mean squared error (MSE) criterion by introducing an "outlying label," which scales the one-hot teacher label according to class rarity. This creates class-dependent margins in the output space, explicitly pushing rare classes farther from the origin than common classes, and ensures that each output neuron receives gradient updates irrespective of the ground-truth class. The approach was introduced and thoroughly analyzed in "MSE Loss with Outlying Label for Imbalanced Classification" (Kato et al., 2021).

1. Outlying Teacher Label Construction

Given $K$ classes with training set counts $C_k$ , the classes are ranked in descending order by their frequencies, establishing a unique rank $r(k)\in\{1,\dots,K\}$ for each class $k$ . The teacher label for ground-truth class $y$ is the standard $K$ -dimensional one-hot vector, but multiplied by a class- and globally-determined scale. Formally,

$\hat y_j = \begin{cases} \alpha\,r(y), & j = y, \ 0, & j \neq y, \end{cases}$

where $\alpha > 0$ is a margin hyperparameter. Rarer classes (larger $r(y)$ ) are explicitly assigned higher target logits, moving their representations further from the origin. This mechanism produces "outlying margins" for rare classes and "inlying" status for common classes.

2. MarginMSE-Loss Definition and Gradient Properties

Let the model output (e.g., classifier logits) for a given input be $a\in\mathbb{R}^K$ . The MarginMSE-Loss is defined as: $C_k$ 0 The corresponding gradient w.r.t. each output logit is

$C_k$ 1

All output logits contribute to the gradient for every sample. This differs from cross-entropy, where the non-true-class logits' gradients vanish for the one-hot label, resulting in updates only via the true-class logit per sample.

3. Effects on Feature Space and Backpropagation

The explicit scaling of the target by $C_k$ 2 yields a margin effect in the output and, by extension, in the penultimate-layer feature space. Rare classes are encouraged to be "outliers," separated from the decision boundary, while frequent classes remain close to the origin. This spatial effect is visualized using t-SNE: features of rare classes under cross-entropy collapse towards the global centroid, whereas under MarginMSE-Loss, class clusters exhibit ring-like separation proportional to $C_k$ 3. Every output dimension receives balanced parameter updates, equalizing the number of gradient steps regardless of class balance in the training data.

4. Comparison to Cross-Entropy and Margin-Aware Losses

The standard cross-entropy (CE) loss with one-hot targets and softmax normalization focuses gradient updates only along the true-class branch. It does not provide direct suppression of false-class outputs toward zero, as $C_k$ 4 yields $C_k$ 5 for $C_k$ 6. Weighted-CE and margin-aware variants (e.g., LDAM) introduce class-dependent weighting or subtractive margins inside the softmax. MarginMSE-Loss instead sets the margin directly at the target level within a regression-style quadratic loss and uses all-class updates, fundamentally shifting the supervision signal and backpropagation statistics.

5. Implementation and Hyperparameterization

MarginMSE-Loss requires precomputing class ranks $C_k$ 7 from dataset statistics. Each minibatch constructs outlying labels as per the definition above, on which the model's logits are compared using the quadratic loss. The margin-scale hyperparameter $C_k$ 8 determines the extent of target separation and must be tuned using a validation split. The protocol is as follows:

Precompute $C_k$ 9 for all classes.
Build outlying label $r(k)\in\{1,\dots,K\}$ 0 for each batch as above.
Compute outputs $r(k)\in\{1,\dots,K\}$ 1 and evaluate $r(k)\in\{1,\dots,K\}$ 2.
Backpropagate the loss and update parameters.
Tune $r(k)\in\{1,\dots,K\}$ 3 using validation performance.

Default $r(k)\in\{1,\dots,K\}$ 4 values used in empirical evaluations include: CIFAR-10 (ResNet-34 from scratch): $r(k)\in\{1,\dots,K\}$ 5; CIFAR-100 (ResNet-34 from scratch): $r(k)\in\{1,\dots,K\}$ 6; Food-101 (pre-trained ResNet-50): $r(k)\in\{1,\dots,K\}$ 7; CamVid (FastFCN): $r(k)\in\{1,\dots,K\}$ 8. Optimization regimes and batch sizes followed standard practice for these datasets.

6. Empirical Results and Geometric Effects

Experimental evaluation demonstrates that MarginMSE-Loss provides consistent improvements over standard and weighted cross-entropy objectives in both classification and semantic segmentation on long-tailed benchmarks. For imbalanced CIFAR-10 (ResNet-34, imbalance ratio 100), the method increased Top-1 accuracy from 44.50% (CE) to 46.99% (+2.49 points) and improved F-measure from 39.25% to 42.89%. On CIFAR-100, Top-1 accuracy increased by 4.87 points over vanilla CE. In Food-101 and CamVid segmentation, improvements of 1.43 points and 5.20 points, respectively, were observed over weighted-CE. Feature-space visualizations consistently show improved separation of rare versus common classes, with rare class clusters receding outward as determined by their outlying margins, mitigating feature collapse and class overlap (Kato et al., 2021).

7. Implementation Workflow and Application Notes

A concise implementation recipe is provided:

Compute class ranks.
For each training batch, construct outlying target labels.
Compute network logits and apply MarginMSE-Loss.
Apply backpropagation with all-class gradient updates.
Tune margin-scale $r(k)\in\{1,\dots,K\}$ 9 on a validation set.

Applications to image classification and semantic segmentation demonstrate the ease of integrating this loss by changing only the loss and label construction, without architectural modifications or sophisticated reweighting. The method is particularly effective in scenarios suffering from severe class imbalance, establishing a new feature-space geometry that protects rare classes from domination by majority ones.

For in-depth methodological, empirical, and geometric details, see "MSE Loss with Outlying Label for Imbalanced Classification" (Kato et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

MSE Loss with Outlying Label for Imbalanced Classification (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MarginMSE-Loss.

MarginMSE-Loss: Imbalanced Classification

1. Outlying Teacher Label Construction

2. MarginMSE-Loss Definition and Gradient Properties

3. Effects on Feature Space and Backpropagation

4. Comparison to Cross-Entropy and Margin-Aware Losses

5. Implementation and Hyperparameterization

6. Empirical Results and Geometric Effects

7. Implementation Workflow and Application Notes

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MarginMSE-Loss: Imbalanced Classification

1. Outlying Teacher Label Construction

2. MarginMSE-Loss Definition and Gradient Properties

3. Effects on Feature Space and Backpropagation

4. Comparison to Cross-Entropy and Margin-Aware Losses

5. Implementation and Hyperparameterization

6. Empirical Results and Geometric Effects

7. Implementation Workflow and Application Notes

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research