Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Delving into Multimodal Prompting for Fine-grained Visual Classification (2309.08912v2)

Published 16 Sep 2023 in cs.CV and cs.MM

Abstract: Fine-grained visual classification (FGVC) involves categorizing fine subdivisions within a broader category, which poses challenges due to subtle inter-class discrepancies and large intra-class variations. However, prevailing approaches primarily focus on uni-modal visual concepts. Recent advancements in pre-trained vision-LLMs have demonstrated remarkable performance in various high-level vision tasks, yet the applicability of such models to FGVC tasks remains uncertain. In this paper, we aim to fully exploit the capabilities of cross-modal description to tackle FGVC tasks and propose a novel multimodal prompting solution, denoted as MP-FGVC, based on the contrastive language-image pertaining (CLIP) model. Our MP-FGVC comprises a multimodal prompts scheme and a multimodal adaptation scheme. The former includes Subcategory-specific Vision Prompt (SsVP) and Discrepancy-aware Text Prompt (DaTP), which explicitly highlights the subcategory-specific discrepancies from the perspectives of both vision and language. The latter aligns the vision and text prompting elements in a common semantic space, facilitating cross-modal collaborative reasoning through a Vision-Language Fusion Module (VLFM) for further improvement on FGVC. Moreover, we tailor a two-stage optimization strategy for MP-FGVC to fully leverage the pre-trained CLIP model and expedite efficient adaptation for FGVC. Extensive experiments conducted on four FGVC datasets demonstrate the effectiveness of our MP-FGVC.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Visual Prompting: Modifying Pixel Space to Adapt Pre-trained Models. CoRR, abs/2203.17274.
  2. Food-101 - Mining Discriminative Components with Random Forests. In ECCV, volume 8694, 446–461.
  3. Destruction and Construction Learning for Fine-Grained Image Recognition. In CVPR, 5157–5166.
  4. Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning. In CVPR, 4109–4118.
  5. Erasing, transforming, and noising defense network for occluded person re-identification. IEEE Transactions on Circuits and Systems for Video Technology.
  6. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR.
  7. Progressive Learning of Category-Consistent Multi-Granularity Features for Fine-Grained Visual Classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12): 9521–9535.
  8. Channel Interaction Networks for Fine-Grained Image Categorization. In AAAI, 10818–10825.
  9. Weakly Supervised Complementary Parts Models for Fine-Grained Image Classification From the Bottom Up. In CVPR, 3034–3043.
  10. Individuality Meets Commonality: A Unified Graph Learning Framework for Multi-View Clustering. ACM Transactions on Knowledge Discovery from Data, 17(1): 7:1–7:21.
  11. ONION: Joint Unsupervised Feature Selection and Robust Subspace Extraction for Graph-based Multi-View Clustering. ACM Transactions on Knowledge Discovery from Data, 17(5): 70:1–70:23.
  12. TransFG: A Transformer Architecture for Fine-Grained Recognition. In AAAI, 852–860.
  13. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In CVPR, 595–604.
  14. RAMS-Trans: Recurrent Attention Multi-scale Transformer for Fine-grained Image Recognition. In ACM Multimedia, 4239–4248.
  15. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In ICML, volume 139, 4904–4916.
  16. Visual Prompt Tuning. In ECCV, volume 13693, 709–727.
  17. Multi-Scale Multi-View Deep Feature Aggregation for Food Recognition. IEEE Transactions on Image Processing, 29: 265–276.
  18. Maple: Multi-modal prompt learning. In CVPR, 19113–19122.
  19. Novel Dataset for Fine-Grained Image Categorization. In CVPR Workshop, volume 2.
  20. Multimodal Prompting with Missing Modalities for Visual Recognition. In CVPR, 14943–14952.
  21. CLIP-ReID: Exploiting Vision-Language Model for Image Re-identification without Concrete Text Labels. In AAAI, volume 37, 1405–1413.
  22. Knowledge-guided semantic transfer network for few-shot image recognition. IEEE Transactions on Neural Networks and Learning Systems.
  23. Filtration and Distillation: Enhancing Region Attention for Fine-Grained Visual Categorization. In AAAI, 11555–11562.
  24. Cross-X Learning for Fine-Grained Visual Categorization. In ICCV, 8241–8250.
  25. No matter how: Top-down effects of verbal and semantic category knowledge on early visual perception. Cognitive, Affective, & Behavioral Neuroscience, 19: 859–876.
  26. Ingredient-Guided Cascaded Multi-Attention Network for Food Recognition. In ACM Multimedia, 1331–1339.
  27. ISIA Food-500: A Dataset for Large-Scale Food Recognition via Stacked Global-Local Attention Network. In ACM Multimedia, 393–401.
  28. Large Scale Visual Food Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8): 9932–9949.
  29. Learning Transferable Visual Models From Natural Language Supervision. In ICML, volume 139, 8748–8763.
  30. Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification. In ICCV, 1005–1014.
  31. SIM-Trans: Structure Information Modeling Transformer for Fine-grained Visual Categorization. In ACM Multimedia, 5853–5861.
  32. Fine-grained Image Classification via Multi-scale Selective Hierarchical Biquadratic Pooling. ACM Transactions on Multimedia Computing, Communications, and Applications, 18(1s): 31:1–31:23.
  33. BlockMix: Meta Regularization and Self-Calibrated Inference for Metric-Based Meta-Learning. In ACM Multimedia, 610–618.
  34. M3Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition. In ACM Multimedia, 1719–1728.
  35. Learning attention-guided pyramidal features for few-shot fine-grained recognition. Pattern Recognition, 130: 108792.
  36. The caltech-ucsd birds-200-2011 dataset.
  37. Open-Set Fine-Grained Retrieval via Prompting Vision-Language Evaluator. In CVPR, 19381–19391.
  38. Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting. In CVPR, 23034–23044.
  39. Multiresolution Discriminative Mixup Network for Fine-Grained Visual Categorization. IEEE Transactions on Neural Networks and Learning Systems, 1–13.
  40. Fine-Grained Visual Classification Via Internal Ensemble Learning Transformer. IEEE Transactions on Multimedia, 1–14.
  41. Image-specific information suppression and implicit local alignment for text-based person search. IEEE Transactions on Neural Networks and Learning Systems.
  42. Fine-Grained Visual Prompting. CoRR, abs/2306.04356.
  43. FILIP: Fine-grained Interactive Language-Image Pre-Training. In ICLR.
  44. Hierarchical Bilinear Pooling for Fine-Grained Visual Recognition. In ECCV, volume 11220, 595–610.
  45. Fusing Pre-Trained Language Models with Multimodal Prompts through Reinforcement Learning. In CVPR, 10845–10856.
  46. Learning a Mixture of Granularity-Specific Experts for Fine-Grained Categorization. In ICCV, 8330–8339.
  47. Graph-Based High-Order Relation Discovery for Fine-Grained Recognition. In CVPR, 15079–15088.
  48. Conditional Prompt Learning for Vision-Language Models. In CVPR, 16795–16804.
  49. Learning to Prompt for Vision-Language Models. International Journal of Computer Vision, 130(9): 2337–2348.
  50. Learning Attentive Pairwise Interaction for Fine-Grained Classification. In AAAI, 13130–13137.
Citations (8)

Summary

We haven't generated a summary for this paper yet.