Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Long-Tailed Learning for Generalized Category Discovery (2506.06965v1)

Published 8 Jun 2025 in cs.AI and cs.CV

Abstract: Generalized Category Discovery (GCD) utilizes labeled samples of known classes to discover novel classes in unlabeled samples. Existing methods show effective performance on artificial datasets with balanced distributions. However, real-world datasets are always imbalanced, significantly affecting the effectiveness of these methods. To solve this problem, we propose a novel framework that performs generalized category discovery in long-tailed distributions. We first present a self-guided labeling technique that uses a learnable distribution to generate pseudo-labels, resulting in less biased classifiers. We then introduce a representation balancing process to derive discriminative representations. By mining sample neighborhoods, this process encourages the model to focus more on tail classes. We conduct experiments on public datasets to demonstrate the effectiveness of the proposed framework. The results show that our model exceeds previous state-of-the-art methods.

Summary

  • The paper introduces a novel framework for generalized category discovery on imbalanced datasets by addressing long-tailed distributions with dedicated techniques.
  • It employs a self-guided labeling method using long-tailed clustering and a representation balancing process that refines feature learning, effectively boosting accuracy.
  • Experimental results on datasets like CIFAR-100-LT demonstrate robust improvements, achieving up to 74.4% accuracy and consistent performance across variable imbalance ratios.

Long-Tailed Learning for Generalized Category Discovery

The paper addresses the challenge of Generalized Category Discovery (GCD) in the context of long-tailed data distributions. GCD aims to leverage labeled samples from known classes to identify novel classes in unlabeled datasets. Existing methods predominantly focus on scenarios with balanced data distributions, which is not representative of real-world datasets that often exhibit long-tailed distributions. This imbalance poses significant challenges in effectively discovering novel categories due to the dominance of a few classes and the scarcity of others.

To tackle this challenge, the authors propose a novel framework specifically designed for long-tailed GCD, contributing several innovative techniques to enhance both the classification and representation learning processes in the presence of imbalanced data.

Framework Overview

  1. Self-Guided Labeling Technique
    • The framework introduces a self-guided labeling technique that employs a learnable data distribution to create pseudo-labels. This approach aims to address the biases in classifiers caused by imbalanced data, resulting in improved label accuracy.
    • This method relies on the estimation of class distributions through an efficient long-tailed clustering technique, followed by the refinement of these distributions using a learnable adjustment strategy guided by the Sinkhorn-Knopp algorithm.
  2. Representation Balancing Process
    • The framework includes a representation balancing phase, which focuses the model’s attention on tail classes by evaluating and leveraging the neighborhood density of samples in the feature space.
    • This process aims to generate more discriminative representations by emphasizing underrepresented classes, thereby improving the model's ability to classify both known and novel categories.

Experimental Evaluation

The proposed framework is empirically validated on several public long-tailed datasets: CIFAR-10-LT, CIFAR-100-LT, ImageNet-100-LT, and Places-365-LT. The experimental results demonstrate that the framework outperforms existing state-of-the-art methods in all evaluated scenarios, highlighting improvements in the detection and classification of novel classes in imbalanced datasets.

  • On datasets like CIFAR-100-LT, the framework exhibits superior performance by achieving an overall accuracy of 74.4%, significantly higher than previous methods. The use of self-guided labeling and representation balancing proves effective in addressing biases towards head classes and enhancing the model's focus on tail classes.
  • The framework's improved performance is consistent across varying imbalance ratios, demonstrating robustness and adaptability to different levels of data imbalance.

Implications and Future Directions

The proposed framework significantly advances the field of generalized category discovery by effectively addressing long-tailed data distributions, a common characteristic of real-world datasets. By doing so, it paves the way for more accurate and efficient class discovery in diverse applications ranging from autonomous driving to healthcare.

The research suggests several avenues for future exploration. Extending the framework to operate without any annotated data could further broaden its applicability. Moreover, integrating advanced representation learning techniques may enhance the discovery of subtle and complex class structures within the data.

This comprehensive framework not only improves upon the existing methodologies but also establishes a solid foundation for future research in the domain of generalized category discovery with long-tailed data.

Youtube Logo Streamline Icon: https://streamlinehq.com