SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant (2403.11299v2)
Abstract: Recent advances in vision-LLMs have shown notable generalization in broad tasks through visual instruction tuning. However, bridging the gap between the pre-trained vision encoder and the LLMs becomes the whole network's bottleneck. To improve cross-modality alignment, existing works usually consider more visual instruction data covering a broader range of vision tasks to fine-tune the model for question-answering, which, however, is costly to obtain and has not thoroughly explored the rich contextual information contained in images. This paper first attempts to harness the overlooked context within visual instruction data, training the model to self-supervised "learning" how to ask high-quality questions. In this way, we introduce a novel framework named SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant. SQ-LLaVA exhibits proficiency in generating flexible and meaningful image-related questions while analyzing the visual clue and prior language knowledge, signifying an advanced level of generalized visual understanding. Moreover, fine-tuning SQ-LLaVA on higher-quality instruction data shows a performance improvement compared with traditional visual-instruction tuning methods. This improvement highlights the efficacy of self-questioning techniques in achieving a deeper and more nuanced comprehension of visual content across various contexts.
- Visual instruction tuning. In NeurIPS, 2023a.
- Improved baselines with visual instruction tuning. In ArXiv, 2023b.
- Sharegpt4v: Improving large multi-modal models with better captions. In ArXiv, 2023.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. In ArXiv, 2023.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. In ICLR, 2024.
- Judging llm-as-a-judge with mt-bench and chatbot arena. In NeurIPS, 2023.
- Gpt-4v(ision) system card. 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. In ICLR, 2024.
- Lora: Low-rank adaptation of large language models. In ICLR, 2022.
- Language models are few-shot learners. In NeurIPS, 2020.
- Palm: Scaling language modeling with pathways. In J. Mach. Learn. Res., 2022.
- Llama: Open and efficient foundation language models. In ArXiv, 2023a.
- Finetuned language models are zero-shot learners. In ICLR, 2022.
- Multitask prompted training enables zero-shot task generalization. In ICLR, 2022.
- Wizardlm: Empowering large language models to follow complex instructions. In ICLR, 2024.
- Meta-dataset: A dataset of datasets for learning to learn from few examples. In ICLR, 2019.
- Training language models to follow instructions with human feedback. In NeurIPS, 2022a.
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In EMNLP, 2022a.
- Training language models to follow instructions with human feedback. In NeurIPS, 2022b.
- Llama 2: Open foundation and fine-tuned chat models. In ArXiv, 2023b.
- Best practice strategies for effective use of questions as a teaching tool. In American journal of pharmaceutical education, 2013.
- Self-instruct: Aligning language models with self-generated instructions. In ACL, 2022b.
- K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. In Information Sciences, 2023.
- Andrea Vattani. K-means requires exponentially many iterations even in the plane. In Annual Symposium on Computational Geometry, 2009.
- Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
- Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019.
- Towards vqa models that can read. In CVPR, 2019.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. In IJCV, 2016.
- ReferItGame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
- Segment anything. In ICCV, 2023.
- Im2text: Describing images using 1 million captioned photographs. In NeurIPS, 2011.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. In NeurIPS Workshop, 2021.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
- Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, 2018.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022.
- Evaluating object hallucination in large vision-language models. In EMNLP, 2023.
- Mm-vet: Evaluating large multimodal models for integrated capabilities. In ArXiv, 2023.
- Mmbench: Is your multi-modal model an all-around player? In ArXiv, 2023c.
- Microsoft coco captions: Data collection and evaluation server. In ArXiv, 2015.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. In TACL, 2014.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018.
- Nocaps: Novel object captioning at scale. In ICCV, 2019.
- Cross-domain image captioning with discriminative finetuning. In CVPR, 2023.
- Bleu: A method for automatic evaluation of machine translation. In ACL, 2002.
- Cider: Consensus-based image description evaluation. In CVPR, 2015.
- Ron Mokady. Clipcap: Clip prefix for image captioning. In ArXiv, 2021.
- Decoupled weight decay regularization. In ICLR, 2019.
- Object hallucination in image captioning. In EMNLP, 2018.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. In ArXiv, 2023.
- Guohao Sun (7 papers)
- Can Qin (37 papers)
- Jiamian Wang (8 papers)
- Zeyuan Chen (40 papers)
- Ran Xu (89 papers)
- Zhiqiang Tao (26 papers)