Synthetic Multimodal Question Generation (2407.02233v2)
Abstract: Multimodal Retrieval Augmented Generation (MMRAG) is a powerful approach to question-answering over multimodal documents. A key challenge with evaluating MMRAG is the paucity of high-quality datasets matching the question styles and modalities of interest. In light of this, we propose SMMQG, a synthetic data generation framework. SMMQG leverages interplay between a retriever, LLM and large multimodal model (LMM) to generate question and answer pairs directly from multimodal documents, with the questions conforming to specified styles and modalities. We use SMMQG to generate an MMRAG dataset of 1024 questions over Wikipedia documents and evaluate state-of-the-art models using it, revealing insights into model performance that are attainable only through style- and modality-specific evaluation data. Next, we measure the quality of data produced by SMMQG via a human study. We find that the quality of SMMQG-generated synthetic data is on par with the quality of the crowdsourced benchmark MMQA and that downstream evaluation results using both datasets strongly concur.
- Team Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku.
- Qwen technical report.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.
- Language models are few-shot learners.
- Webqa: Multihop and multimodal qa.
- Murag: Multimodal retrieval-augmented generator for open question answering over images and text.
- HybridQA: A dataset of multi-hop question answering over tabular and textual data. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1026–1036, Online. Association for Computational Linguistics.
- Can pre-trained vision and language models answer visual information-seeking questions?
- Reproducible scaling laws for contrastive language-image learning.
- Chroma. 2022. Home | Chroma — docs.trychroma.com. https://docs.trychroma.com/. [Accessed 03-05-2024].
- Bert: Pre-training of deep bidirectional transformers for language understanding.
- Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs.
- Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
- Ragas: Automated evaluation of retrieval augmented generation.
- Team Google. 2024. Gemini: A family of highly capable multimodal models.
- Measuring mathematical problem solving with the math dataset.
- Scaling up visual and vision-language representation learning with noisy text supervision.
- Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 252–262, New Orleans, Louisiana. Association for Computational Linguistics.
- Prometheus 2: An open source language model specialized in evaluating other language models.
- Benchmarking cognitive biases in large language models as evaluators. ArXiv, abs/2309.17012.
- Bioasq-qa: A manually curated corpus for biomedical question answering. Scientific Data, 10(1):170.
- Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.
- Retrieval-augmented generation for knowledge-intensive nlp tasks.
- Weizhe Lin and Bill Byrne. 2022. Retrieval augmented visual question answering with outside knowledge.
- Visual instruction tuning.
- Do question answering modeling improvements hold across benchmarks?
- Universal vision-language dense retrieval: Learning a unified representation space for multi-modal retrieval.
- Learn to explain: Multimodal reasoning via thought chains for science question answering.
- Unifying text, tables, and images for multimodal question answering. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9355–9367, Singapore. Association for Computational Linguistics.
- Ok-vqa: A visual question answering benchmark requiring external knowledge.
- Team OpenAI. 2024. Gpt-4 technical report.
- Unsupervised multi-hop question answering by question generation. arXiv preprint arXiv:2010.12623.
- Generating natural questions from images for multimodal assistants.
- Instruction tuning with gpt-4.
- Training question answering models from synthetic data.
- Learning transferable visual models from natural language supervision.
- Squad: 100,000+ questions for machine comprehension of text.
- End-to-end synthetic data generation for domain adaptation of question answering systems.
- Multimodalqa: Complex question answering over text, tables and images.
- Prompt2model: Generating deployable models from natural language instructions.
- Asking and answering questions to evaluate the factual consistency of summaries.
- Text embeddings by weakly-supervised contrastive pre-training.
- Zichao Wang and Richard Baraniuk. 2023. Multiqg-ti: Towards question generation from multi-modal sources. arXiv preprint arXiv:2307.04643.
- Uniir: Training and benchmarking universal multimodal information retrievers.
- Hotpotqa: A dataset for diverse, explainable multi-hop question answering.
- Retrieval-augmented multimodal language modeling.
- Metamath: Bootstrap your own mathematical questions for large language models.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi.
- Retrieving multimodal information for augmented generation: A survey.
- Judging llm-as-a-judge with mt-bench and chatbot arena.
- Ian Wu (6 papers)
- Sravan Jayanthi (4 papers)
- Vijay Viswanathan (14 papers)
- Simon Rosenberg (2 papers)
- Sina Pakazad (3 papers)
- Tongshuang Wu (53 papers)
- Graham Neubig (342 papers)