Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph (2309.13625v1)

Published 24 Sep 2023 in cs.CV and cs.AI

Abstract: Adapter-style efficient transfer learning (ETL) has shown excellent performance in the tuning of vision-LLMs (VLMs) under the low-data regime, where only a few additional parameters are introduced to excavate the task-specific knowledge based on the general and powerful representation of VLMs. However, most adapter-style works face two limitations: (i) modeling task-specific knowledge with a single modality only; and (ii) overlooking the exploitation of the inter-class relationships in downstream tasks, thereby leading to sub-optimal solutions. To mitigate that, we propose an effective adapter-style tuning strategy, dubbed GraphAdapter, which performs the textual adapter by explicitly modeling the dual-modality structure knowledge (i.e., the correlation of different semantics/classes in textual and visual modalities) with a dual knowledge graph. In particular, the dual knowledge graph is established with two sub-graphs, i.e., a textual knowledge sub-graph, and a visual knowledge sub-graph, where the nodes and edges represent the semantics/classes and their correlations in two modalities, respectively. This enables the textual feature of each prompt to leverage the task-specific structure knowledge from both textual and visual modalities, yielding a more effective classifier for downstream tasks. Extensive experimental results on 11 benchmark datasets reveal that our GraphAdapter significantly outperforms previous adapter-based methods. The code will be released at https://github.com/lixinustc/GraphAdapter

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems, 35:32897–32912, 2022.
  2. Translating embeddings for modeling multi-relational data. Advances in neural information processing systems, 26, 2013.
  3. Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 446–461. Springer, 2014.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  6. Plot: Prompt learning with optimal transport for vision-language models. In The Eleventh International Conference on Learning Representations.
  7. Knowledge graph transfer network for few-shot recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 10575–10582, 2020.
  8. Learning semantic-specific graph representation for multi-label image recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 522–531, 2019.
  9. Knowledge-embedded routing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6163–6171, 2019.
  10. Debiasing vision-language models via biased prompts. arXiv preprint arXiv:2302.00070, 2023.
  11. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014.
  12. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  14. Hypergraph pre-training with graph neural networks. arXiv preprint arXiv:2105.10862, 2021.
  15. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004.
  16. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021.
  17. Finetune like you pretrain: Improved finetuning of zero-shot vision models. arXiv preprint arXiv:2212.00638, 2022.
  18. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  19. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
  20. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8349, 2021.
  21. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15262–15271, 2021.
  22. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  23. Unsupervised prompt learning for vision-language models. arXiv preprint arXiv:2204.03649, 2022.
  24. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
  25. Maple: Multi-modal prompt learning. arXiv preprint arXiv:2210.03117, 2022.
  26. T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
  27. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
  28. Ablating concepts in text-to-image diffusion models. arXiv preprint arXiv:2303.13516, 2023.
  29. Multi-concept customization of text-to-image diffusion. arXiv preprint arXiv:2212.04488, 2022.
  30. Graph learning regularization and transfer learning for few-shot event detection. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2172–2176, 2021.
  31. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
  32. Graph signal processing, graph neural network and graph learning on biological data: a systematic review. IEEE Reviews in Biomedical Engineering, 2021.
  33. Few-shot real image restoration via distortion-relation guided transfer learning. arXiv preprint arXiv:2111.13078, 2021.
  34. Scaling & shifting your features: A new baseline for efficient model tuning. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  35. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
  36. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  37. Prompt generation networks for efficient adaptation of frozen vision transformers. arXiv preprint arXiv:2210.06466, 2022.
  38. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019.
  39. Prompt distribution learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5206–5215, 2022.
  40. Prediction calibration for generalized few-shot semantic segmentation. IEEE Transactions on Image Processing, 2023.
  41. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
  42. A survey on visual transfer learning using knowledge graphs. Semantic Web, 13(3):477–510, 2022.
  43. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  44. M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722–729. IEEE, 2008.
  45. Svl-adapter: Self-supervised adapter for vision-language pretrained models. arXiv preprint arXiv:2210.03794, 2022.
  46. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012.
  47. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2085–2094, 2021.
  48. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  49. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  50. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18082–18091, 2022.
  51. Do imagenet classifiers generalize to imagenet? In International conference on machine learning, pages 5389–5400. PMLR, 2019.
  52. A. Rezvanian and M. R. Meybodi. Stochastic graph as a model for social networks. Computers in Human Behavior, 64:621–640, 2016.
  53. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  54. Continual diffusion: Continual customization of text-to-image diffusion with c-lora. arXiv e-prints, pages arXiv–2304, 2023.
  55. Test-time prompt tuning for zero-shot generalization in vision-language models. arXiv preprint arXiv:2209.07511, 2022.
  56. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  57. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530, 2019.
  58. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5227–5237, 2022.
  59. H. Tan and M. Bansal. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019.
  60. Motionclip: Exposing human motion generation to clip space. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII, pages 358–374. Springer, 2022.
  61. Learning robust global representations by penalizing local predictive power. Advances in Neural Information Processing Systems, 32, 2019.
  62. Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472, 2021.
  63. Clip-guided prototype modulating for few-shot action recognition. arXiv preprint arXiv:2303.02982, 2023.
  64. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904, 2021.
  65. Vita-clip: Video and text adaptive clip via multimodal prompting. arXiv preprint arXiv:2304.03307, 2023.
  66. Hierarchical relational learning for few-shot knowledge graph completion. arXiv preprint arXiv:2209.01205, 2022.
  67. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010.
  68. One-shot relational learning for knowledge graphs. arXiv preprint arXiv:1808.09040, 2018.
  69. Dmh-fsl: Dual-modal hypergraph for few-shot learning. Neural Processing Letters, 54(2):1317–1332, 2022.
  70. Task residual for tuning vision-language models. arXiv preprint arXiv:2211.10277, 2022.
  71. Vision-language models for vision tasks: A survey. arXiv preprint arXiv:2304.00685, 2023.
  72. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5579–5588, 2021.
  73. Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners. arXiv preprint arXiv:2303.02151, 2023.
  74. Tip-adapter: Training-free adaption of clip for few-shot classification. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pages 493–510. Springer, 2022.
  75. Prompting through prototype: A prototype-based prompt learning on pretrained vision-language models. arXiv preprint arXiv:2210.10841, 2022.
  76. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022.
  77. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
  78. Not all features matter: Enhancing few-shot clip with adaptive prior refinement. arXiv preprint arXiv:2304.01195, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xin Li (980 papers)
  2. Dongze Lian (19 papers)
  3. Zhihe Lu (14 papers)
  4. Jiawang Bai (23 papers)
  5. Zhibo Chen (176 papers)
  6. Xinchao Wang (203 papers)
Citations (41)

Summary

An Overview of "GraphAdapter: Tuning Vision-LLMs With Dual Knowledge Graph"

This paper introduces GraphAdapter, a novel framework for adapter-style efficient transfer learning (ETL) that enhances the tuning of vision-LLMs (VLMs) by leveraging a dual knowledge graph consisting of textual and visual modalities. The main objective is to address the limitations of existing adapter-style approaches which typically focus on task-specific knowledge using a single modality and often overlook the inter-class relationships inherent in downstream tasks. GraphAdapter seeks to overcome these issues by modeling structure knowledge explicitly to yield more effective classifiers for vision-language tasks in scenarios with limited data.

Key Contributions

  1. Dual Knowledge Graph: GraphAdapter employs a dual knowledge graph composed of two sub-graphs—textual and visual—where nodes correspond to the semantics or classes and edges represent the correlations between different classes in their respective modality spaces.
  2. Graph Learning: Graph learning techniques, particularly graph convolution networks (GCN), are utilized to extract structure knowledge for features from both modalities, enabling a robust and informed tuning process that accounts for inter-class relationships not otherwise captured by individual modality adaptation.
  3. Integration of Multimodal Knowledge: The paper proposes integrating both intra-modality and cross-modality structure knowledge within the adapter framework, enhancing the adaptation of VLMs by enriching the embedded knowledge drawn from textual and visual domains.
  4. Empirical Results: Extensive experiments conducted across 11 benchmark datasets demonstrate that GraphAdapter significantly outperforms prior methods in adapter-style tuning, showcasing its potential for yielding superior classification results with minimal data.

Strong Numerical Results

The authors report that GraphAdapter excels over previous methods in average performance, specifically under few-shot settings—critical for scenarios where data availability is sparse. GraphAdapter achieves improved results in challenging fine-grained classification tasks like FGVCAircraft, illustrating the efficacy of modeling dual-modality structure knowledge.

Implications and Future Directions

GraphAdapter positions itself as a strong contender in the field of ETL, particularly for vision-LLMs constrained by data limitations. The insights provided by leveraging structured graphs point towards promising future investigations into more complex graph-learning paradigms, deeper integration techniques for knowledge graphs, and expanded evaluation across varied VLM architectures. The dual-modality approach also sets a precedent for further exploration of multimodal learning frameworks, which could enhance the applicability and robustness of vision-language applications in realistic and varied environments.

The findings and methodologies proposed in this paper could substantially influence how researchers approach the tuning of VLMs, potentially extending to broader applications wherein multimodal data plays a critical role. By advancing the discourse surrounding the representation and utilization of structured knowledge within ETL frameworks, GraphAdapter paves the way for innovative approaches to effectively leverage large-scale, pre-trained models while minimizing the reliance on data and computing resources.

Github Logo Streamline Icon: https://streamlinehq.com