Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-modal Large Language Model Enhanced Pseudo 3D Perception Framework for Visual Commonsense Reasoning (2301.13335v2)

Published 30 Jan 2023 in cs.CV and cs.MM

Abstract: The visual commonsense reasoning (VCR) task is to choose an answer and provide a justifying rationale based on the given image and textural question. Representative works first recognize objects in images and then associate them with key words in texts. However, existing approaches do not consider exact positions of objects in a human-like three-dimensional (3D) manner, making them incompetent to accurately distinguish objects and understand visual relation. Recently, multi-modal LLMs (MLLMs) have been used as powerful tools for several multi-modal tasks but not for VCR yet, which requires elaborate reasoning on specific visual objects referred by texts. In light of the above, an MLLM enhanced pseudo 3D perception framework is designed for VCR. Specifically, we first demonstrate that the relation between objects is relevant to object depths in images, and hence introduce object depth into VCR frameworks to infer 3D positions of objects in images. Then, a depth-aware Transformer is proposed to encode depth differences between objects into the attention mechanism of Transformer to discriminatively associate objects with visual scenes guided by depth. To further associate the answer with the depth of visual scene, each word in the answer is tagged with a pseudo depth to realize depth-aware association between answer words and objects. On the other hand, BLIP-2 as an MLLM is employed to process images and texts, and the referring expressions in texts involving specific visual objects are modified with linguistic object labels to serve as comprehensible MLLM inputs. Finally, a parameter optimization technique is devised to fully consider the quality of data batches based on multi-level reasoning confidence. Experiments on the VCR dataset demonstrate the superiority of the proposed framework over state-of-the-art approaches.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. C. Shang, H. Li, H. Qiu, Q. Wu, F. Meng, T. Zhao, and K. N. Ngan, “Cross-modal recurrent semantic comprehension for referring image segmentation,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 7, pp. 3229–3242, Jul. 2023.
  2. J. Zhu and H. Wang, “Multi-scale conditional relationship graph network for referring relationships in images,” IEEE Trans. Cogn. Dev. Syst., vol. 14, no. 2, pp. 752–760, Jun. 2022.
  3. W. Xu, Z. Miao, J. Yu, Y. Tian, L. Wan, and Q. Ji, “Bridging video and text: A two-step polishing transformer for video captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 9, pp. 6293–6307, Sep. 2022.
  4. L. Yan, S. Ma, Q. Wang, Y. Chen, X. Zhang, A. Savakis, and D. Liu, “Video captioning using global-local representation,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 10, pp. 6642–6656, Oct. 2022.
  5. L. Zhao, D. Cai, J. Zhang, L. Sheng, D. Xu, R. Zheng, Y. Zhao, L. Wang, and X. Fan, “Toward explainable 3d grounded visual question answering: A new benchmark and strong baseline,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 6, pp. 2935–2949, Jun. 2023.
  6. W. Guo, Y. Zhang, J. Yang, and X. Yuan, “Re-attention for visual question answering,” IEEE Trans. Image Process., vol. 30, pp. 6730–6743, Jul. 2021.
  7. R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi, “From recognition to cognition: Visual commonsense reasoning,” in Proc. CVPR’19, Jun. 2019, pp. 6713–6724.
  8. X. Zhang, F. Zhang, and C. Xu, “Explicit cross-modal representation learning for visual commonsense reasoning,” IEEE Trans. Multimedia, vol. 24, pp. 2986–2997, 2022.
  9. D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “MiniGPT-4: Enhancing vision-language understanding with advanced large language models,” in arXiv:2304.10592, Apr. 2023.
  10. H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” in arXiv:2304.08485, Apr. 2023.
  11. W. Yu, J. Zhou, W. Yu, X. Liang, and N. Xiao, “Heterogeneous graph learning for visual commonsense reasoning,” in Proc. NeurIPS’19, Dec. 2019, pp. 2769–2779.
  12. J. Lin, U. Jain, and A. G. Schwing, “TAB-VCR: Tags and attributes based vcr baselines,” in Proc. NeurIPS’19, Dec. 2019, pp. 15 615–15 628.
  13. X. Zhang, F. Zhang, and C. Xu, “Multi-level counterfactual contrast for visual commonsense reasoning,” in Proc. ACM MM’21, Oct. 2021, pp. 1793–1802.
  14. R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision transformers for dense prediction,” in Proc. CVPR’21, Jun. 2021, pp. 12 179–12 188.
  15. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. NeurIPS’17, Dec. 2017, pp. 5998–6008.
  16. J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in arXiv:2301.12597, Jan. 2023.
  17. Z. Wen and Y. Peng, “Multi-level knowledge injecting for visual commonsense reasoning,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 3, pp. 1042–1054, May 2020.
  18. W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “VL-BERT: Pre-training of generic visual-linguistic representations,” in Proc. ICLR’20, Apr. 2020.
  19. W. Li, C. Gao, G. Niu, X. Xiao, H. Liu, J. Liu, H. Wu, and W. Haifeng, “UNIMO: Towards unified-modal understanding and generation via cross-modal contrastive learning,” in Proc. ACL’21, Aug. 2021, pp. 2592–2607.
  20. A. Wu, L. Zhu, Y. Han, and Y. Yang, “Connective cognition network for directional visual commonsense reasoning,” in Proc. NeurIPS’19, Dec. 2019, pp. 5669–5679.
  21. J. Zhu and H. Wang, “Multi-modal structure-embedding graph transformer for visual commonsense reasoning,” IEEE Trans. Multimedia (Early access), 2023.
  22. H. Zhang, H. Uchiyama, S. Ono, and H. Kawasaki, “MOTSLAM: Mot-assisted monocular dynamic slam using single-view depth estimation,” in arXiv:2210.02038, Oct. 2022.
  23. Y. Cao, H. Zhang, Y. Li, C. Ren, and C. Lang, “CMAN: Leaning global structure correlation for monocular 3d object detection,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 12, pp. 24 727–24 737, Dec. 2022.
  24. M. Ergül and A. Alatan, “Depth is all you need: Single-stage weakly supervised semantic segmentation from image-level supervision,” in Proc. ICIP’22, Oct. 2022, pp. 4233–4237.
  25. A. Pilzer, S. Lathuilière, D. Xu, M. M. Puscas, E. Ricci, and N. Sebe, “Progressive fusion for unsupervised binocular depth estimation using cycled networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 10, pp. 2380–2395, Oct. 2020.
  26. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR’16, Jun. 2016, pp. 770–778.
  27. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. NAACL’19, Jun. 2019, pp. 4171–4186.
  28. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. ICLR’14, Apr. 2014.
  29. Q. Chen, X. Zhu, Z.-H. Ling, S. Wei, H. Jiang, and D. Inkpen, “Enhanced lstm for natural language inference,” in Proc. ACL’17, Jul. 2017, pp. 1657–1668.
  30. A. Jabri, A. Joulin, and L. Van Der Maaten, “Revisiting visual question answering baselines,” in Proc. ECCV’16, Oct. 2016, pp. 727–739.
  31. P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proc. CVPR’18, Jun. 2018, pp. 6077–6086.
  32. J.-H. Kim, K.-W. On, W. Lim, J. Kim, J.-W. Ha, and B.-T. Zhang, “Hadamard product for low-rank bilinear pooling,” in Proc. ICLR’17, Nov. 2017.
  33. H. Ben-Younes, R. Cadene, M. Cord, and N. Thome, “Mutan: Multimodal tucker fusion for visual question answering,” in Proc. CVPR’17, Jun. 2017, pp. 2612–2620.
  34. Z. Li, Y. Guo, K. Wang, Y. Wei, L. Nie, and M. Kankanhalli, “Joint answering and explanation for visual commonsense reasoning,” in arXiv:2202.12626, Feb. 2022.
  35. T. Wang, J. Huang, H. Zhang, and Q. Sun, “Visual commonsense r-cnn,” in Proc. CVPR’20, Jun. 2020, pp. 10 757–10 767.
  36. K. Ye and A. Kovashka, “A case study of the shortcut effects in visual commonsense reasoning,” in Proc. AAAI’21, Feb. 2021, pp. 3181–3189.
  37. L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and C. Kai-Wei, “VisualBERT: A simple and performant baseline for vision and language,” in arXiv:1908.03557, Aug. 2019.
  38. Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “UNITER: Universal image-text representation learning,” in Proc. ECCV’20, Jul. 2020, pp. 104–120.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Jian Zhu (59 papers)
  2. Hanli Wang (22 papers)
  3. Miaojing Shi (53 papers)
Citations (1)