Papers
Topics
Authors
Recent
Search
2000 character limit reached

Learning Prototype Classifiers for Long-Tailed Recognition

Published 1 Feb 2023 in cs.CV and cs.LG | (2302.00491v3)

Abstract: The problem of long-tailed recognition (LTR) has received attention in recent years due to the fundamental power-law distribution of objects in the real-world. Most recent works in LTR use softmax classifiers that are biased in that they correlate classifier norm with the amount of training data for a given class. In this work, we show that learning prototype classifiers addresses the biased softmax problem in LTR. Prototype classifiers can deliver promising results simply using Nearest-Class- Mean (NCM), a special case where prototypes are empirical centroids. We go one step further and propose to jointly learn prototypes by using distances to prototypes in representation space as the logit scores for classification. Further, we theoretically analyze the properties of Euclidean distance based prototype classifiers that lead to stable gradient-based optimization which is robust to outliers. To enable independent distance scales along each channel, we enhance Prototype classifiers by learning channel-dependent temperature parameters. Our analysis shows that prototypes learned by Prototype classifiers are better separated than empirical centroids. Results on four LTR benchmarks show that Prototype classifier outperforms or is comparable to state-of-the-art methods. Our code is made available at https://github.com/saurabhsharma1993/prototype-classifier-ltr.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Long-tailed recognition via weight balancing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6897–6907, 2022.
  2. Learning imbalanced datasets with label-distribution-aware margin loss. In NeurIPS, 2019.
  3. Smote: synthetic minority over-sampling technique. JAIR, 2002.
  4. Smoteboost: Improving prediction of the minority class in boosting. In European conference on principles of data mining and knowledge discovery, 2003.
  5. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
  6. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
  7. Class-balanced loss based on effective number of samples. In CVPR, 2019.
  8. Parametric contrastive learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 715–724, 2021.
  9. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  10. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
  11. C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on learning from imbalanced datasets II, 2003.
  12. A multiple resampling method for learning from imbalanced data sets. Computational intelligence, 2004.
  13. Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training. Proceedings of the National Academy of Sciences, 118(43):e2103091118, 2021.
  14. Deepncm: Deep nearest class mean classifiers. In International Conference on Learning Representations Workshop, 2018.
  15. Learning from imbalanced data. TKDE, 2009.
  16. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
  17. Decoupling representation and classifier for long-tailed recognition. In Eighth International Conference on Learning Representations (ICLR), 2020.
  18. Focal loss for dense object detection. In ICCV, 2017.
  19. Large-scale long-tailed recognition in an open world. In CVPR, 2019.
  20. Gistnet: a geometric structure transfer network for long-tailed recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8209–8218, 2021.
  21. Sgdr: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017.
  22. Long-tail learning via logit adjustment. In International Conference on Learning Representations, 2021.
  23. Distance-based image classification: Generalizing to new classes at near-zero cost. TPAMI, 2013.
  24. Fernando Nogueira et al. Bayesian optimization: Open source constrained global optimization tool for python. URL https://github. com/fmfn/BayesianOptimization, 2014.
  25. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020.
  26. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017.
  27. Balanced meta-softmax for long-tailed visual recognition. Advances in Neural Information Processing Systems, 33:4175–4186, 2020.
  28. Distributional robustness loss for long-tail learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
  29. Long-tailed recognition using class-balanced experts. In DAGM German Conference on Pattern Recognition. Springer, 2020.
  30. Prototypical networks for few-shot learning. In NeurIPS, 2017.
  31. Meta-transfer learning for few-shot learning. In CVPR, 2019.
  32. Imbalance trouble: Revisiting neural-collapse geometry. In Advances in Neural Information Processing Systems, 2022.
  33. The devil is in the tails: Fine-grained classification in the wild. arXiv preprint arXiv:1709.01450, 2017.
  34. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778, 2018.
  35. Long-tailed recognition by routing diverse distribution-aware experts. In International Conference on Learning Representations, 2020.
  36. f-vaegan-d2: A feature generating framework for any-shot learning. In CVPR, 2019.
  37. Learning from multiple experts: Self-paced knowledge distillation for long-tailed classification. In European Conference on Computer Vision. Springer, 2020.
  38. Distribution alignment: A unified framework for long-tail visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2361–2370, 2021.
  39. Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9719–9728, 2020.
Citations (11)

Summary

  • The paper presents prototype classifiers that mitigate softmax bias by using Euclidean distances instead of dot products with classifier weights.
  • It introduces channel-dependent temperature scaling and logit adjustment to enhance performance, particularly for minority classes.
  • Extensive experiments show the method's robustness, rapid convergence, and superior tail class accuracy across various benchmarks.

Prototype Classifiers for Long-Tailed Recognition: Technical Summary

Background and Motivation

Long-tailed recognition (LTR) presents significant challenges due to power-law distributions prevalent in real-world visual datasets, resulting in severe class imbalance. Conventional classifiers utilizing softmax activation suffer from biased decision boundaries, predominantly favoring head classes. This bias arises because the softmax classifier norm is strongly correlated with class size, encoding class priors through weight magnitudes. Recent research has sought to mitigate these effects with data resampling, loss reshaping, ensemble models, and knowledge transfer—all with varying degrees of success.

The paper, "Learning Prototype Classifiers for Long-Tailed Recognition" (2302.00491), introduces an alternative paradigm: prototype classifiers, which assign classes based on distances in representation space rather than dot products with classifier weights. Prototype classifiers, especially those utilizing learnable prototypes, decouple classifier norm from class size, thereby countering the head class bias inherent in softmax approaches. Figure 1

Figure 1: Softmax classifiers encode class priors via weight magnitude, yielding biased boundaries; prototype classifiers leverage Euclidean distances to learnable prototypes, producing fairer decision boundaries.

Prototype Classifier Formulation

The proposed model operates in two stages. Feature representations are learned with auxiliary softmax classifiers on imbalanced data. Subsequently, these representations are fixed, and prototypes are initialized as empirical centroids. A learnable prototype cyc_y per class yy is used, with class probabilities determined via negative Euclidean distance:

logp(yg(x))12d(g(x),cy)\log p(y|g(x)) \propto -\frac{1}{2}d(g(x),c_y)

where d(g(x),cy)d(g(x),c_y) is the Euclidean distance in representation space.

Gradient Analysis

A salient theoretical result demonstrates that the gradient of the negative log likelihood objective with respect to a prototype cyc_y is proportional to the misclassification probability. The gradient direction aligns with g(x)cyg(x) - c_y for positive samples and reverses for negatives. Critically, the gradient norm is independent of the distance itself, yielding robust, stable optimization not susceptible to outlying samples. Contrastingly, squared Euclidean distance amplifies outlier influence, corroborated by substantial empirical performance drops.

Enhancements: Channel-Dependent Temperatures and Logit Adjustment

Recognizing that feature channels may have distinct variance scales, the model introduces channel-dependent temperature parameters TiT_i:

dCDT(g(x),y)=i=1d(g(x)cy)i2Tid_{CDT}(g(x),y) = \sqrt{\sum_{i=1}^d \frac{(g(x)-c_y)_i^2}{T_i}}

This equates to learning a diagonal Mahalanobis metric in the representation space, offering feature scaling and selection capabilities.

Logit adjustment further penalizes head classes during prototype learning:

Lxy=loge12d(g(x),cy)+τlogNyye12d(g(x),cy)+τlogNyL_{xy} = -\log \frac{e^{-\frac{1}{2} d(g(x),c_y) + \tau\log N_y}}{\sum_{y'} e^{-\frac{1}{2}d(g(x),c_{y'}) + \tau\log N_{y'}}}

where NyN_y is class sample count and τ\tau is the adjustment weight. This improves tail class recall, as validated by ablation studies.

Empirical Evaluation

Prototype classifiers are evaluated on CIFAR100-LT, ImageNet-LT, iNaturalist18, and CIFAR10-LT. Models use fixed feature backbones (e.g., ResNet-32, ResNeXt50, ResNet-50), initialized via end-to-end training with strong weight decay and balanced sampling.

Ablation Studies

  • Nearest-Class-Mean (NCM): Employing empirical centroids (no learning) surges past softmax accuracy due to unbiased norms.
  • Learnable Prototypes: Further boosts accuracy, particularly on tail classes.
  • Channel-Dependent Temperatures and Logit Adjustment: Combined, these refinements attain maximal accuracy, especially for minority classes.

Norm Visualization

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: Prototype classifier achieves equinorm prototype vectors, eliminating classifier norm correlation with class size—a key improvement over softmax and empirical centroids.

Prototype Evolution

Figure 3

Figure 3

Figure 3: Training increases prototype separation, both in Euclidean and angular terms. Minority classes exhibit reduced separation due to the Minority Collapse effect but are still substantially improved over centroids.

Distance Metric Comparison

Empirical results confirm that Euclidean distance is optimal for prototype learning, with squared Euclidean rendering models highly sensitive to outliers and cosine distance underperforming on tail classes.

Backbone Agnosticism

Prototype classifiers consistently outperform alternate classifier-learning schemes across all tested feature backbones, confirming their robustness and transferability.

Benchmark Results

Extensive evaluation across benchmarks shows that prototype classifiers outperform or are comparable to state-of-the-art long-tailed recognition solutions (e.g., cRT, LogitAdjust, DRO-LT, WD + MaxNorm, PaCO) without reliance on self-supervised pretraining or ensembling. The model is particularly effective for tail class accuracy, addressing limitations of conventional methods.

Practical and Theoretical Implications

The proposed prototype classifier framework establishes a principled geometric alternative to the softmax paradigm for imbalanced data. Its stability and bias-resistance suggest broader applicability to scenarios where classifier norm–class size coupling is detrimental. Model speed—prototype learning converges in a single epoch—makes it adaptable for rapid deployment.

Channel-wise temperature scaling hints at further integration with metric learning, feature selection, and domain adaptation strategies. As prototype classifiers are a layer-level design, they are compatible with contemporary backbone advances, ensembling, and self-supervised representation learning.

Future research might extend prototype learning to more general metric spaces or combine it with ensemble architectures for enhanced robustness.

Conclusion

Prototype classifiers, formulated as learnable centroids in representation space with Euclidean distances and channel-dependent temperature scaling, demonstrate superior performance in long-tailed recognition tasks. They circumvent the biased classifier norm problem inherent in softmax models, exhibit stable optimization dynamics robust to outliers, and produce better-separated class representations. These properties are validated both theoretically and empirically across multiple benchmark datasets. The approach offers substantial practical advantages and lays the foundation for further innovation in classifier design for imbalanced data regimes.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of what remains missing, uncertain, or unexplored in the paper; each item is framed to suggest concrete directions for future work.

  • Assumption of unimodal class-conditional Gaussians: The method assumes a single prototype per class and unit-variance Gaussian likelihoods, which may be unrealistic for multi-modal classes; effectiveness with multiple prototypes per class or mixture models is not explored.
  • Prototype count per class: The paper does not investigate learning K>1 prototypes per class (e.g., clustering-based initialization or learnable K) or selection criteria for K.
  • Theoretical analysis limited to Euclidean distance without CDT/LA: Stability proofs cover Euclidean distance only, but not the actual training objective with channel-dependent temperatures and logit adjustment; formal guarantees under CDT and LA are absent.
  • Squared Euclidean analysis vs practice mismatch: The “bias negation via coupling” argument uses squared Euclidean distance, whereas training uses Euclidean distance; there is no theoretical bridge showing the same bias behavior under the Euclidean metric used in practice.
  • Convergence and generalization guarantees: No formal convergence rates, sample complexity, or generalization bounds are provided for the prototype learning dynamics under long-tailed distributions.
  • Robustness to label noise and heavy-tailed feature noise: Although the gradient is distance-invariant for Euclidean distance, the method’s sensitivity to mislabeled samples and systematic noise is untested; robust distance losses (Huber/Tukey) or robust prototype regularizers are not studied.
  • CDT positivity and regularization: The paper does not specify how T_i are constrained to be positive nor how they are regularized (e.g., softplus parameterization, priors/penalties), risking invalid or degenerate temperatures.
  • Overfitting risks of CDT: There is no study of overfitting when learning channel-wise temperatures on scarce tail-class data; regularization strategies (e.g., sparsity, shrinkage to 1, layer-wise grouping) are untested.
  • Scope of metric learning: CDT implements only a diagonal Mahalanobis metric; full (low-rank or structured) Mahalanobis learning, class-wise or prototype-wise metrics, or input-conditional metrics are not explored.
  • Calibration and uncertainty: The paper reports accuracy only; calibration metrics (ECE, Brier), class-wise calibration under imbalance, and the effect of LA on calibration are not measured.
  • Inference-time prior shift: Logit adjustment is used during training only; strategies for handling changing class priors at inference (e.g., posterior recalibration, Bayes correction) are not evaluated.
  • Computational and memory scalability: With thousands of classes (e.g., iNat18) the O(C·d) cost of per-sample distance computation may be significant; techniques like approximate nearest prototypes, hierarchical search, or prototype pruning are not analyzed.
  • Prototype initialization strategies: Only centroid initialization is considered; sensitivity to initialization and alternatives (e.g., K-means, medoids, density-weighted centroids) are not studied.
  • Training schedule minimalism: Prototype learning uses only one epoch; the impact of longer training, different optimizers, or annealing schedules (for CDT/LA) is not investigated.
  • Sensitivity analyses: No systematic ablation for the logit-adjustment strength τ, prototype learning rate, CDT learning rate, momentum, or data-augmentation choices is provided.
  • Data augmentation with frozen features: Prototype training applies augmentations while freezing the backbone; it is unclear if features are recomputed online or cached; the trade-offs and consistency of this setup are not examined.
  • End-to-end vs decoupled training: The paper adopts two-stage training with frozen backbones; whether joint end-to-end training of features and prototypes (possibly with prototype/metric regularization) improves performance is untested.
  • Comparison breadth: Methods closely related to normalized/margin-based softmax (e.g., NormFace/CosFace/ArcFace variants) and strong long-tail baselines beyond LDAM/LA are not comprehensively compared under identical backbones and tuning.
  • Backbone diversity: Experiments use specific CNN backbones; generality to modern transformers, larger models, or self-supervised backbones is not evaluated.
  • Feature normalization effects: The role of feature normalization (e.g., L2 normalization, whitening) on prototype learning and CDT behavior is not investigated.
  • Handling multi-label or hierarchical labels: The method targets single-label classification; extension to multi-label or hierarchical long-tailed recognition (e.g., leveraging class taxonomies) is unexplored.
  • Real-world test distributions: Results are reported on balanced validation/test sets; performance under naturally imbalanced test distributions, cost-sensitive metrics, or class-conditional risk is not studied.
  • Tail-class minority collapse: While prototype separation increases on average, Tail–Tail separation remains smaller than Head–Head; mechanisms explicitly countering minority collapse (e.g., margin constraints between tail prototypes) are not proposed.
  • OOD detection and rejection: Prototype-based models often lend themselves to distance-based rejection; the method’s OOD detection or abstention capabilities are not evaluated.
  • Fairness and “more fair decision boundaries” claim: The claim of fairer boundaries is not validated with fairness metrics or analyses of per-class decision margins under imbalance.
  • Effect of feature scale shifts: Because distances are sensitive to representation scaling, robustness to backbone scale shifts (e.g., due to different normalization/weight decay) is not assessed.
  • Prototype and CDT regularization: There is no exploration of explicit prototype norm constraints, inter-prototype margin regularizers, or CDT sparsity/entropy penalties to improve separation and generalization.
  • Large-class regimes: Feasibility and performance for very large label spaces (e.g., 100k classes) in terms of search, memory, and prototype sharing/tying are not examined.
  • Beyond accuracy: No analysis of error types (confusion matrices, per-class recall/precision), long-tail macro/micro-F1, or tail vs head calibration trade-offs is provided.
  • Multi-task or incremental learning: Prototypes are a natural fit for class-incremental or few-shot tail updates; incremental addition of classes, prototype replay, or continual learning settings are not tested.
  • Domain shift robustness: The approach is evaluated in in-domain settings only; robustness to domain shift, corruption, and distribution drift is not measured.
  • Adversarial robustness: The sensitivity of distance-based prototypes to adversarial perturbations and potential defenses (e.g., adversarially robust metrics) remain open.
  • Theoretical link to Bayes-optimality under imbalance: A formal treatment connecting prototype learning with Bayes-optimal classifiers under imbalanced priors (with/without LA) is missing.
  • Practical constraints on T_i: No discussion of numerical stability (e.g., division by very small T_i), parameterization tricks, or bounds to ensure stable training and inference.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.