Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering (2404.13947v3)

Published 22 Apr 2024 in cs.CV

Abstract: While large visual-LLMs (LVLM) have shown promising results on traditional visual question answering benchmarks, it is still challenging for them to answer complex VQA problems which requires diverse world knowledge. Motivated by the research of retrieval-augmented generation in the field of natural language processing, we use Dense Passage Retrieval (DPR) to retrieve related knowledge to help the model answer questions. However, DPR conduct retrieving in natural language space, which may not ensure comprehensive acquisition of image information. Thus, the retrieved knowledge is not truly conducive to helping answer the question, affecting the performance of the overall system. To address this issue, we propose a novel framework that leverages the visual-LLM to select the key knowledge retrieved by DPR and answer questions. The framework consists of two modules: Selector and Answerer, where both are initialized by the LVLM and parameter-efficiently finetuned by self-bootstrapping: find key knowledge in the retrieved knowledge documents using the Selector, and then use them to finetune the Answerer to predict answers; obtain the pseudo-labels of key knowledge documents based on the predictions of the Answerer and weak supervision labels, and then finetune the Selector to select key knowledge; repeat. Our framework significantly enhances the performance of the baseline on the challenging open-domain Knowledge-based VQA benchmark, OK-VQA, achieving a state-of-the-art accuracy of 62.83%. Our code is publicly available at https://github.com/haodongze/Self-KSel-QAns.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. 2023. Vicuna. https://github.com/lm-sys/FastChat.
  2. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems 35 (2022), 23716–23736.
  3. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2017), 6077–6086. https://api.semanticscholar.org/CorpusID:3753452
  4. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425–2433.
  5. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  6. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022).
  7. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. ArXiv abs/2305.06500 (2023). https://api.semanticscholar.org/CorpusID:258615266
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  9. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  10. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19358–19369.
  11. Transform-retrieve-generate: Natural language-centric outside-knowledge visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5067–5077.
  12. Conceptbert: Concept-aware representation for visual question answering. In Findings of the Association for Computational Linguistics: EMNLP 2020. 489–498.
  13. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6904–6913.
  14. Kat: A knowledge augmented transformer for vision-and-language. arXiv preprint arXiv:2112.08614 (2021).
  15. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations. https://openreview.net/forum?id=nZeVKeeFYf9
  16. Learning to reason: End-to-end module networks for visual question answering. In Proceedings of the IEEE international conference on computer vision. 804–813.
  17. Promptcap: Prompt-guided task-aware image captioning. arXiv preprint arXiv:2211.09699 (2022).
  18. Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6700–6709.
  19. In Defense of Grid Features for Visual Question Answering. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 10264–10273. https://api.semanticscholar.org/CorpusID:210156985
  20. Billion-scale similarity search with gpus. IEEE Transactions on Big Data 7, 3 (2019), 535–547.
  21. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 (2020).
  22. Bilinear attention networks. Advances in neural information processing systems 31 (2018).
  23. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023).
  24. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning. PMLR, 12888–12900.
  25. Relation-aware graph attention network for visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision. 10313–10322.
  26. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. Springer, 121–137.
  27. Weizhe Lin and Bill Byrne. 2022. Retrieval augmented visual question answering with outside knowledge. arXiv preprint arXiv:2210.03809 (2022).
  28. Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering. ArXiv abs/2309.17133 (2023). https://api.semanticscholar.org/CorpusID:263310932
  29. Revive: Regional visual representation matters in knowledge-based visual question answering. Advances in Neural Information Processing Systems 35 (2022), 10560–10571.
  30. Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023).
  31. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32 (2019).
  32. Weakly-supervised visual-retriever-reader for knowledge-based question answering. arXiv preprint arXiv:2109.04014 (2021).
  33. Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based vqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14111–14121.
  34. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition. 3195–3204.
  35. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  36. A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge. In European Conference on Computer Vision. https://api.semanticscholar.org/CorpusID:249375629
  37. Prompting large language models with answer heuristics for knowledge-based visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14974–14983.
  38. How Much Can CLIP Benefit Vision-and-Language Tasks?. In International Conference on Learning Representations.
  39. Combo of Thinking and Observing for Outside-Knowledge VQA. arXiv preprint arXiv:2305.06407 (2023).
  40. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the AAAI conference on artificial intelligence, Vol. 31.
  41. Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019).
  42. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  43. Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Commun. ACM 57, 10 (2014), 78–85.
  44. Explicit knowledge-based reasoning for visual question answering. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (Melbourne, Australia) (IJCAI’17). AAAI Press, 1290–1296.
  45. Fvqa: Fact-based visual question answering. IEEE transactions on pattern analysis and machine intelligence 40, 10 (2017), 2413–2427.
  46. Explicit knowledge-based reasoning for visual question answering. arXiv preprint arXiv:1511.02570 (2015).
  47. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning. PMLR, 23318–23340.
  48. Multi-modal answer validation for knowledge-based vqa. In Proceedings of the AAAI conference on artificial intelligence, Vol. 36. 2712–2721.
  49. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 3081–3089.
  50. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6281–6290.
  51. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5579–5588.
  52. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Dongze Hao (2 papers)
  2. Qunbo Wang (5 papers)
  3. Longteng Guo (31 papers)
  4. Jie Jiang (246 papers)
  5. Jing Liu (525 papers)