Delving into Multimodal Prompting for Fine-grained Visual Classification (2309.08912v2)
Abstract: Fine-grained visual classification (FGVC) involves categorizing fine subdivisions within a broader category, which poses challenges due to subtle inter-class discrepancies and large intra-class variations. However, prevailing approaches primarily focus on uni-modal visual concepts. Recent advancements in pre-trained vision-LLMs have demonstrated remarkable performance in various high-level vision tasks, yet the applicability of such models to FGVC tasks remains uncertain. In this paper, we aim to fully exploit the capabilities of cross-modal description to tackle FGVC tasks and propose a novel multimodal prompting solution, denoted as MP-FGVC, based on the contrastive language-image pertaining (CLIP) model. Our MP-FGVC comprises a multimodal prompts scheme and a multimodal adaptation scheme. The former includes Subcategory-specific Vision Prompt (SsVP) and Discrepancy-aware Text Prompt (DaTP), which explicitly highlights the subcategory-specific discrepancies from the perspectives of both vision and language. The latter aligns the vision and text prompting elements in a common semantic space, facilitating cross-modal collaborative reasoning through a Vision-Language Fusion Module (VLFM) for further improvement on FGVC. Moreover, we tailor a two-stage optimization strategy for MP-FGVC to fully leverage the pre-trained CLIP model and expedite efficient adaptation for FGVC. Extensive experiments conducted on four FGVC datasets demonstrate the effectiveness of our MP-FGVC.
- Visual Prompting: Modifying Pixel Space to Adapt Pre-trained Models. CoRR, abs/2203.17274.
- Food-101 - Mining Discriminative Components with Random Forests. In ECCV, volume 8694, 446–461.
- Destruction and Construction Learning for Fine-Grained Image Recognition. In CVPR, 5157–5166.
- Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning. In CVPR, 4109–4118.
- Erasing, transforming, and noising defense network for occluded person re-identification. IEEE Transactions on Circuits and Systems for Video Technology.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR.
- Progressive Learning of Category-Consistent Multi-Granularity Features for Fine-Grained Visual Classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12): 9521–9535.
- Channel Interaction Networks for Fine-Grained Image Categorization. In AAAI, 10818–10825.
- Weakly Supervised Complementary Parts Models for Fine-Grained Image Classification From the Bottom Up. In CVPR, 3034–3043.
- Individuality Meets Commonality: A Unified Graph Learning Framework for Multi-View Clustering. ACM Transactions on Knowledge Discovery from Data, 17(1): 7:1–7:21.
- ONION: Joint Unsupervised Feature Selection and Robust Subspace Extraction for Graph-based Multi-View Clustering. ACM Transactions on Knowledge Discovery from Data, 17(5): 70:1–70:23.
- TransFG: A Transformer Architecture for Fine-Grained Recognition. In AAAI, 852–860.
- Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In CVPR, 595–604.
- RAMS-Trans: Recurrent Attention Multi-scale Transformer for Fine-grained Image Recognition. In ACM Multimedia, 4239–4248.
- Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In ICML, volume 139, 4904–4916.
- Visual Prompt Tuning. In ECCV, volume 13693, 709–727.
- Multi-Scale Multi-View Deep Feature Aggregation for Food Recognition. IEEE Transactions on Image Processing, 29: 265–276.
- Maple: Multi-modal prompt learning. In CVPR, 19113–19122.
- Novel Dataset for Fine-Grained Image Categorization. In CVPR Workshop, volume 2.
- Multimodal Prompting with Missing Modalities for Visual Recognition. In CVPR, 14943–14952.
- CLIP-ReID: Exploiting Vision-Language Model for Image Re-identification without Concrete Text Labels. In AAAI, volume 37, 1405–1413.
- Knowledge-guided semantic transfer network for few-shot image recognition. IEEE Transactions on Neural Networks and Learning Systems.
- Filtration and Distillation: Enhancing Region Attention for Fine-Grained Visual Categorization. In AAAI, 11555–11562.
- Cross-X Learning for Fine-Grained Visual Categorization. In ICCV, 8241–8250.
- No matter how: Top-down effects of verbal and semantic category knowledge on early visual perception. Cognitive, Affective, & Behavioral Neuroscience, 19: 859–876.
- Ingredient-Guided Cascaded Multi-Attention Network for Food Recognition. In ACM Multimedia, 1331–1339.
- ISIA Food-500: A Dataset for Large-Scale Food Recognition via Stacked Global-Local Attention Network. In ACM Multimedia, 393–401.
- Large Scale Visual Food Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8): 9932–9949.
- Learning Transferable Visual Models From Natural Language Supervision. In ICML, volume 139, 8748–8763.
- Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification. In ICCV, 1005–1014.
- SIM-Trans: Structure Information Modeling Transformer for Fine-grained Visual Categorization. In ACM Multimedia, 5853–5861.
- Fine-grained Image Classification via Multi-scale Selective Hierarchical Biquadratic Pooling. ACM Transactions on Multimedia Computing, Communications, and Applications, 18(1s): 31:1–31:23.
- BlockMix: Meta Regularization and Self-Calibrated Inference for Metric-Based Meta-Learning. In ACM Multimedia, 610–618.
- M3Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition. In ACM Multimedia, 1719–1728.
- Learning attention-guided pyramidal features for few-shot fine-grained recognition. Pattern Recognition, 130: 108792.
- The caltech-ucsd birds-200-2011 dataset.
- Open-Set Fine-Grained Retrieval via Prompting Vision-Language Evaluator. In CVPR, 19381–19391.
- Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting. In CVPR, 23034–23044.
- Multiresolution Discriminative Mixup Network for Fine-Grained Visual Categorization. IEEE Transactions on Neural Networks and Learning Systems, 1–13.
- Fine-Grained Visual Classification Via Internal Ensemble Learning Transformer. IEEE Transactions on Multimedia, 1–14.
- Image-specific information suppression and implicit local alignment for text-based person search. IEEE Transactions on Neural Networks and Learning Systems.
- Fine-Grained Visual Prompting. CoRR, abs/2306.04356.
- FILIP: Fine-grained Interactive Language-Image Pre-Training. In ICLR.
- Hierarchical Bilinear Pooling for Fine-Grained Visual Recognition. In ECCV, volume 11220, 595–610.
- Fusing Pre-Trained Language Models with Multimodal Prompts through Reinforcement Learning. In CVPR, 10845–10856.
- Learning a Mixture of Granularity-Specific Experts for Fine-Grained Categorization. In ICCV, 8330–8339.
- Graph-Based High-Order Relation Discovery for Fine-Grained Recognition. In CVPR, 15079–15088.
- Conditional Prompt Learning for Vision-Language Models. In CVPR, 16795–16804.
- Learning to Prompt for Vision-Language Models. International Journal of Computer Vision, 130(9): 2337–2348.
- Learning Attentive Pairwise Interaction for Fine-Grained Classification. In AAAI, 13130–13137.