Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RCA: Region Conditioned Adaptation for Visual Abductive Reasoning (2303.10428v5)

Published 18 Mar 2023 in cs.CV

Abstract: Visual abductive reasoning aims to make likely explanations for visual observations. We propose a simple yet effective Region Conditioned Adaptation, a hybrid parameter-efficient fine-tuning method that equips the frozen CLIP with the ability to infer explanations from local visual cues. We encode local hints'' andglobal contexts'' into visual prompts of the CLIP model separately at fine and coarse-grained levels. Adapters are used for fine-tuning CLIP models for downstream tasks and we design a new attention adapter, that directly steers the focus of the attention map with trainable query and key projections of a frozen CLIP model. Finally, we train our new model with a modified contrastive loss to regress the visual feature simultaneously toward features of literal description and plausible explanations. The loss enables CLIP to maintain both perception and reasoning abilities. Experiments on the Sherlock visual abductive reasoning benchmark show that the RCA significantly outstands previous SOTAs, ranking the \nth{1} on the leaderboards (e.g., Human Acc: RCA 31.74 \textit{vs} CPT-CLIP 29.58, higher =better). We also validate the RCA is generalizable to local perception benchmarks like RefCOCO. We open-source our project at \textit{\color{magenta}{\url{https://github.com/LUNAProject22/RPA}}}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  2. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning.   PMLR, 2021, pp. 8748–8763.
  3. C. Bhagavatula, R. L. Bras, C. Malaviya, K. Sakaguchi, A. Holtzman, H. Rashkin, D. Downey, S. W.-t. Yih, and Y. Choi, “Abductive commonsense reasoning,” arXiv preprint arXiv:1908.05739, 2019.
  4. J. Hessel, J. D. Hwang, J. S. Park, R. Zellers, C. Bhagavatula, A. Rohrbach, K. Saenko, and Y. Choi, “The abduction of sherlock holmes: A dataset for visual abductive reasoning,” arXiv preprint arXiv:2202.04800, 2022.
  5. C. Liang, W. Wang, T. Zhou, and Y. Yang, “Visual abductive reasoning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 565–15 575.
  6. W. Zhao, Y. Rao, Y. Tang, J. Zhou, and J. Lu, “Videoabc: A real-world video dataset for abductive visual reasoning,” IEEE Transactions on Image Processing, vol. 31, pp. 6048–6061, 2022.
  7. Q. Cao, B. Li, X. Liang, K. Wang, and L. Lin, “Knowledge-routed visual question reasoning: Challenges for deep representation embedding,” IEEE Transactions on Neural Networks and Learning Systems (TNNLS), vol. 33, no. 7, pp. 2758–2767, 2021.
  8. M. Małkiński and J. Mańdziuk, “Multi-label contrastive learning for abstract visual reasoning,” IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2022.
  9. W. Pan, Z. Zhao, W. Huang, Z. Zhang, L. Fu, Z. Pan, J. Yu, and F. Wu, “Video moment retrieval with noisy labels,” IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2022.
  10. S.-J. Peng, Y. He, X. Liu, Y.-m. Cheung, X. Xu, and Z. Cui, “Relation-aggregated cross-graph correlation learning for fine-grained image–text retrieval,” IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2022.
  11. S. Yan, H. Tang, L. Zhang, and J. Tang, “Image-specific information suppression and implicit local alignment for text-based person search,” IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2023.
  12. J. Zhang, Z. Fang, H. Sun, and Z. Wang, “Adaptive semantic-enhanced transformer for image captioning,” IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2022.
  13. B. Li, L. Y. Wu, D. Liu, H. Chen, Y. Ye, and X. Xie, “Image template matching via dense and consistent contrastive learning,” in 2023 IEEE International Conference on Multimedia and Expo (ICME).   IEEE, 2023, pp. 1319–1324.
  14. Y. Hao, C.-W. Ngo, and B. Zhu, “Learning to match anchor-target video pairs with dual attentional holographic networks,” IEEE Transactions on Image Processing, vol. 30, pp. 8130–8143, 2021.
  15. J. Yu, H. Li, Y. Hao, B. Zhu, T. Xu, and X. He, “Cgt-gan: Clip-guided text gan for image captioning,” arXiv preprint arXiv:2308.12045, 2023.
  16. Y. Hao, T. Mu, R. Hong, M. Wang, N. An, and J. Y. Goulermas, “Stochastic multiview hashing for large-scale near-duplicate video retrieval,” IEEE Transactions on Multimedia, vol. 19, no. 1, pp. 1–14, 2016.
  17. B. Zhu, C.-W. Ngo, J. Chen, and Y. Hao, “R2gan: Cross-modal recipe retrieval with generative adversarial network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 477–11 486.
  18. P. W. Anderson, “More is different: broken symmetry and the nature of the hierarchical structure of science.” Science, vol. 177, no. 4047, pp. 393–396, 1972.
  19. A. Galassi, M. Lippi, and P. Torroni, “Attention in natural language processing,” IEEE Transactions on Neural Networks and Learning Systems (TNNLS), vol. 32, no. 10, pp. 4291–4308, 2020.
  20. D. W. Otter, J. R. Medina, and J. K. Kalita, “A survey of the usages of deep learning for natural language processing,” IEEE Transactions on Neural Networks and Learning Systems (TNNLS), vol. 32, no. 2, pp. 604–624, 2020.
  21. Y. Liu, Y. Zhang, Y. Wang, F. Hou, J. Yuan, J. Tian, Y. Zhang, Z. Shi, J. Fan, and Z. He, “A survey of visual transformers,” IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2023.
  22. Y. Hao, S. Wang, P. Cao, X. Gao, T. Xu, J. Wu, and X. He, “Attention in attention: Modeling context correlation for efficient video classification,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 10, pp. 7120–7132, 2022.
  23. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  24. C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International Conference on Machine Learning.   PMLR, 2021, pp. 4904–4916.
  25. J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu, “Coca: Contrastive captioners are image-text foundation models,” arXiv preprint arXiv:2205.01917, 2022.
  26. L. Yuan, D. Chen, Y.-L. Chen, N. Codella, X. Dai, J. Gao, H. Hu, X. Huang, B. Li, C. Li et al., “Florence: A new foundation model for computer vision,” arXiv preprint arXiv:2111.11432, 2021.
  27. J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in ICML, 2022.
  28. M. R. Cohen, “The collected papers of charles sanders peirce,” 1933.
  29. C. Bhagavatula, R. L. Bras, C. Malaviya, K. Sakaguchi, A. Holtzman, H. Rashkin, D. Downey, W. tau Yih, and Y. Choi, “Abductive commonsense reasoning,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=Byg1v1HKDB
  30. R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi, “From recognition to cognition: Visual commonsense reasoning,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  31. Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei, “Visual7w: Grounded question answering in images,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4995–5004.
  32. M. Sabry and A. Belz, “Peft-ref: A modular reference architecture and typology for parameter-efficient finetuning techniques,” arXiv preprint arXiv:2304.12410, 2023.
  33. B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” arXiv preprint arXiv:2104.08691, 2021.
  34. Z. Zhong, D. Friedman, and D. Chen, “Factual probing is [mask]: Learning vs. learning to recall,” arXiv preprint arXiv:2104.05240, 2021.
  35. L. Tu, C. Xiong, and Y. Zhou, “Prompt-tuning can be much better than fine-tuning on cross-lingual understanding with multilingual language models,” arXiv preprint arXiv:2210.12360, 2022.
  36. X. Liu, K. Ji, Y. Fu, W. Tam, Z. Du, Z. Yang, and J. Tang, “P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2022, pp. 61–68.
  37. K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” International Journal of Computer Vision (IJCV), 2022.
  38. ——, “Conditional prompt learning for vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 816–16 825.
  39. C. Feng, Y. Zhong, Z. Jie, X. Chu, H. Ren, X. Wei, W. Xie, and L. Ma, “Promptdet: Towards open-vocabulary detection using uncurated images,” in Proceedings of the European Conference on Computer Vision, 2022.
  40. Y. Du, F. Wei, Z. Zhang, M. Shi, Y. Gao, and G. Li, “Learning to prompt for open-vocabulary object detection with vision-language model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 084–14 093.
  41. M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII.   Springer, 2022, pp. 709–727.
  42. H. Bahng, A. Jahanian, S. Sankaranarayanan, and P. Isola, “Exploring visual prompts for adapting large-scale models,” arXiv preprint arXiv:2203.17274, vol. 1, no. 3, p. 4, 2022.
  43. R. Zellers, X. Lu, J. Hessel, Y. Yu, J. S. Park, J. Cao, A. Farhadi, and Y. Choi, “Merlot: Multimodal neural script knowledge models,” in Advances in Neural Information Processing Systems 34, 2021.
  44. Y. Yao, A. Zhang, Z. Zhang, Z. Liu, T.-S. Chua, and M. Sun, “Cpt: Colorful prompt tuning for pre-trained vision-language models,” arXiv preprint arXiv:2109.11797, 2021.
  45. A. Shtedritski, C. Rupprecht, and A. Vedaldi, “What does clip know about a red circle? visual prompt engineering for vlms,” arXiv preprint arXiv:2304.06712, 2023.
  46. T. Yang, Y. Zhu, Y. Xie, A. Zhang, C. Chen, and M. Li, “AIM: Adapting image models for efficient video action recognition,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=CIoSZ_HKHS7
  47. Y.-L. Sung, J. Cho, and M. Bansal, “Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5227–5237.
  48. N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” in International Conference on Machine Learning.   PMLR, 2019, pp. 2790–2799.
  49. H. Zhao, H. Tan, and H. Mei, “Tiny-attention adapter: Contexts are more important than the number of parameters,” in Conference on Empirical Methods in Natural Language Processing, 2022.
  50. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
  51. Y. Zhong, J. Yang, P. Zhang, C. Li, N. Codella, L. H. Li, L. Zhou, X. Dai, L. Yuan, Y. Li et al., “Regionclip: Region-based language-image pretraining,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 793–16 803.
  52. H. Zhang and C.-W. Ngo, “A fine granularity object-level representation for event detection and recounting,” IEEE Transactions on Multimedia, vol. 21, no. 6, pp. 1450–1463, 2018.
  53. H. Zhang, P. Zhang, X. Hu, Y.-C. Chen, L. H. Li, X. Dai, L. Wang, L. Yuan, J.-N. Hwang, and J. Gao, “Glipv2: Unifying localization and vision-language understanding,” in Advances in Neural Information Processing Systems, 2022.
  54. W. Wang, J. Dai, Z. Chen, Z. Huang, Z. Li, X. Zhu, X. Hu, T. Lu, L. Lu, H. Li et al., “Internimage: Exploring large-scale vision foundation models with deformable convolutions,” arXiv preprint arXiv:2211.05778, 2022.
  55. A. Aberdam, D. Bensaïd, A. Golts, R. Ganz, O. Nuriel, R. Tichauer, S. Mazor, and R. Litman, “Clipter: Looking at the bigger picture in scene text recognition,” arXiv preprint arXiv:2301.07464, 2023.
  56. Z. Shao, J. Han, D. Marnerides, and K. Debattista, “Region-object relation-aware dense captioning via transformer,” IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2022.
  57. K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
  58. Z.-Q. Zhao, P. Zheng, S.-t. Xu, and X. Wu, “Object detection with deep learning: A review,” IEEE Transactions on Neural Networks and Learning Systems (TNNLS), vol. 30, no. 11, pp. 3212–3232, 2019.
  59. L. Jiao, R. Zhang, F. Liu, S. Yang, B. Hou, L. Li, and X. Tang, “New generation deep learning for video object detection: A survey,” IEEE Transactions on Neural Networks and Learning Systems (TNNLS), vol. 33, no. 8, pp. 3195–3215, 2021.
  60. S.-C. Huang, Q.-V. Hoang, and T.-H. Le, “Sfa-net: A selective features absorption network for object detection in rainy weather conditions,” IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2022.
  61. X. Bi, J. Hu, B. Xiao, W. Li, and X. Gao, “Iemask r-cnn: Information-enhanced mask r-cnn,” IEEE Transactions on Big Data, vol. 9, no. 2, pp. 688–700, 2022.
  62. R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
  63. S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
  64. Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX.   Springer, 2020, pp. 104–120.
  65. H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019.
  66. L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg, “Modeling context in referring expressions,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14.   Springer, 2016, pp. 69–85.
  67. R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei, “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” 2016. [Online]. Available: https://arxiv.org/abs/1602.07332
  68. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.   Springer, 2014, pp. 740–755.
  69. M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev, “Reproducible scaling laws for contrastive language-image learning,” arXiv preprint arXiv:2212.07143, 2022.
  70. T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training deep nets with sublinear memory cost,” arXiv preprint arXiv:1604.06174, 2016.
  71. L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg, “Mattnet: Modular attention network for referring expression comprehension,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1307–1315.
  72. A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, and N. Carion, “Mdetr-modulated detection for end-to-end multi-modal understanding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1780–1790.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Hao Zhang (947 papers)
  2. Yeo Keat Ee (1 paper)
  3. Basura Fernando (60 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com