Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized Visual Class Discovery (2403.07369v2)

Published 12 Mar 2024 in cs.CV

Abstract: In this paper, we study the problem of Generalized Category Discovery (GCD), which aims to cluster unlabeled data from both known and unknown categories using the knowledge of labeled data from known categories. Current GCD methods rely on only visual cues, which however neglect the multi-modality perceptive nature of human cognitive processes in discovering novel visual categories. To address this, we propose a two-phase TextGCD framework to accomplish multi-modality GCD by exploiting powerful Visual-LLMs. TextGCD mainly includes a retrieval-based text generation (RTG) phase and a cross-modality co-teaching (CCT) phase. First, RTG constructs a visual lexicon using category tags from diverse datasets and attributes from LLMs, generating descriptive texts for images in a retrieval manner. Second, CCT leverages disparities between textual and visual modalities to foster mutual learning, thereby enhancing visual GCD. In addition, we design an adaptive class aligning strategy to ensure the alignment of category perceptions between modalities as well as a soft-voting mechanism to integrate multi-modality cues. Experiments on eight datasets show the large superiority of our approach over state-of-the-art methods. Notably, our approach outperforms the best competitor, by 7.7% and 10.8% in All accuracy on ImageNet-1k and CUB, respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Masked siamese networks for label-efficient learning. In European Conference on Computer Vision, 2022.
  2. Towards language models that can see: Computer vision through the lens of natural language. arXiv preprint arXiv:2306.16410, 2023.
  3. Food-101–mining discriminative components with random forests. In European Conference on Computer Vision, 2014.
  4. Language models are few-shot learners. Advances in Neural Information Processing Systems, 2020.
  5. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
  6. Boosting co-teaching with compression regularization for label noise. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  7. Describing textures in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2014.
  8. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2009.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  10. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  11. Co-teaching: Robust training of deep neural networks with extremely noisy labels. Advances in Neural Information Processing Systems, 2018.
  12. 3d object representations for fine-grained categorization. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2013a.
  13. 3d object representations for fine-grained categorization. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2013b.
  14. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 2017.
  15. Learning multiple layers of features from tiny images. Technical Report, 2009.
  16. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 2020.
  17. Caltech 101, 2022a.
  18. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, 2022b.
  19. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  20. Scaling language-image pre-training via masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023b.
  21. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
  22. James MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967.
  23. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
  24. Visual classification via description from large language models. arXiv preprint arXiv:2210.07183, 2022.
  25. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, 2008.
  26. Chils: Zero-shot image classification with hierarchical label sets. In International Conference on Machine Learning. PMLR, 2023.
  27. Clip-gcd: Simple language guided generalized category discovery. arXiv preprint arXiv:2305.10420, 2023.
  28. Cats and dogs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2012.
  29. Dynamic conceptional contrastive learning for generalized category discovery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  30. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
  31. Curriculum graph co-teaching for multi-target domain adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021.
  32. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 2015.
  33. Vladimir M Sloutsky. From perceptual categories to concepts: What develops? Cognitive science, 2010.
  34. The herbarium challenge 2019 dataset. arXiv preprint arXiv:1906.05372, 2019.
  35. Generalized category discovery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  36. The caltech-ucsd birds-200-2011 dataset. Computation & Neural Systems Technical Report, 2011.
  37. Sptnet: An efficient alternative framework for generalized category discovery with spatial prompt tuning. In The Twelfth International Conference on Learning Representations, 2023.
  38. Combating noisy labels by agreement: A joint training method with co-regularization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020.
  39. Parametric classification for generalized category discovery: A baseline study. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
  40. Sun database: Large-scale scene recognition from abbey to zoo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2010.
  41. Asymmetric co-teaching for unsupervised cross-domain person re-identification. In Proceedings of the AAAI conference on artificial intelligence, 2020.
  42. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  43. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  44. Importance-aware co-teaching for offline model-based optimization. Advances in Neural Information Processing Systems, 2024.
  45. Promptcal: Contrastive affinity learning via auxiliary prompts for generalized novel category discovery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  46. Learning semi-supervised gaussian mixture models for generalized category discovery. arXiv preprint arXiv:2305.06144, 2023.
Citations (3)

Summary

We haven't generated a summary for this paper yet.