Exploring Vision-Language Models for Imbalanced Learning (2304.01457v2)

Published 4 Apr 2023 in cs.AI, cs.CV, and cs.LG

Abstract: Vision-LLMs (VLMs) that use contrastive language-image pre-training have shown promising zero-shot classification performance. However, their performance on imbalanced dataset is relatively poor, where the distribution of classes in the training dataset is skewed, leading to poor performance in predicting minority classes. For instance, CLIP achieved only 5% accuracy on the iNaturalist18 dataset. We propose to add a lightweight decoder to VLMs to avoid OOM (out of memory) problem caused by large number of classes and capture nuanced features for tail classes. Then, we explore improvements of VLMs using prompt tuning, fine-tuning, and incorporating imbalanced algorithms such as Focal Loss, Balanced SoftMax and Distribution Alignment. Experiments demonstrate that the performance of VLMs can be further boosted when used with decoder and imbalanced methods. Specifically, our improved VLMs significantly outperforms zero-shot classification by an average accuracy of 6.58%, 69.82%, and 6.17%, on ImageNet-LT, iNaturalist18, and Places-LT, respectively. We further analyze the influence of pre-training data size, backbones, and training cost. Our study highlights the significance of imbalanced learning algorithms in face of VLMs pre-trained by huge data. We release our code at https://github.com/Imbalance-VLM/Imbalance-VLM.

Authors (9)

Yidong Wang (43 papers)
Zhuohao Yu (15 papers)
Jindong Wang (150 papers)
Qiang Heng (8 papers)
Hao Chen (1006 papers)
Wei Ye (110 papers)
Rui Xie (59 papers)
Xing Xie (220 papers)
Shikun Zhang (82 papers)

Citations (21)

View on Semantic Scholar

Summary

Exploring Vision-LLMs for Imbalanced Learning

The paper "Exploring Vision-LLMs for Imbalanced Learning" addresses the application of Vision-LLMs (VLMs), particularly those using contrastive language-image pre-training, in the domain of imbalanced dataset learning. While VLMs such as CLIP have demonstrated robust performance in zero-shot classification tasks, they encounter limitations when applied to datasets with imbalanced class distributions, a challenge prevalent in various practical domains like autonomous driving and medical diagnosis. This article explores methods to enhance VLM performance under imbalanced conditions, focusing on improving their capability to predict minority (tail) classes more accurately.

Key Findings and Methodology

Performance of VLMs on Imbalanced Datasets: The shortcomings of typical VLMs are illustrated with empirical evidence. For instance, CLIP achieves a mere 5% accuracy on the iNaturalist18 dataset, highlighting the necessity for advancements in handling class imbalances.
Integration of a Lightweight Decoder: The authors propose a novel approach by integrating a lightweight decoder post the Vision Transformer (ViT) within VLMs. This addition is aimed at reducing memory overhead while capturing subtle features pertinent to the tail classes.
Incorporation of Imbalanced Learning Techniques: The paper investigates various strategies to boost VLMs' effectiveness on imbalanced datasets. These include prompt tuning, fine-tuning, and implementing imbalanced classification algorithms such as Focal Loss, Balanced SoftMax, and Distribution Alignment. Notably, the improvement strategies prioritize training efficiency and manageable computational costs, ensuring feasibility across different computing setups.
Experimental Validation: The proposed methods are meticulously evaluated through experiments on benchmarks like ImageNet-LT, iNaturalist18, and Places-LT. The results show remarkable enhancements in accuracy with the integration of imbalanced learning techniques—average increases of 6.58%, 69.82%, and 6.17% on the mentioned datasets, respectively.
Analysis of VLM Pre-training and Backbone: Further insights are shared concerning the influence of pre-training data size and model architecture on performance. Comparisons indicate that while larger pre-training datasets do not unequivocally enhance VLMs' abilities to handle imbalanced data, the backbone choice, such as ViT-L14 over ViT-B16, significantly impacts outcomes.

Implications and Future Directions

The research underscores the critical role of refined training methodologies and architecture modifications to maximize VLMs' potential in imbalanced learning scenarios. The deployment of VLMs with tailored algorithms for imbalanced settings can greatly enhance their applicability in safety-critical and domain-specific tasks requiring high precision across all class distributions, particularly those with extensive tail classes.

Moving forward, the paper sets the stage for several avenues of research. Expanding this exploration to a wider array of VLM architectures, scrutinizing their interactions with more diverse datasets, and integrating text encoders represent promising directions. Moreover, extending the approach to include semi-supervised or self-supervised learning paradigms could further improve VLMs' robustness in low-resource environments.

In conclusion, this research contributes significantly to the ongoing dialogue on adapting sophisticated AI models to practical constraints, ensuring that the impressive promise of VLMs is harnessed with efficiency and accuracy, even in challenging imbalanced contexts.

PDF Markdown

Related Papers

GitHub

GitHub - Imbalance-VLM/Imbalance-VLM (112 stars)