Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Synthetic Multimodal Question Generation (2407.02233v2)

Published 2 Jul 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Multimodal Retrieval Augmented Generation (MMRAG) is a powerful approach to question-answering over multimodal documents. A key challenge with evaluating MMRAG is the paucity of high-quality datasets matching the question styles and modalities of interest. In light of this, we propose SMMQG, a synthetic data generation framework. SMMQG leverages interplay between a retriever, LLM and large multimodal model (LMM) to generate question and answer pairs directly from multimodal documents, with the questions conforming to specified styles and modalities. We use SMMQG to generate an MMRAG dataset of 1024 questions over Wikipedia documents and evaluate state-of-the-art models using it, revealing insights into model performance that are attainable only through style- and modality-specific evaluation data. Next, we measure the quality of data produced by SMMQG via a human study. We find that the quality of SMMQG-generated synthetic data is on par with the quality of the crowdsourced benchmark MMQA and that downstream evaluation results using both datasets strongly concur.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Team Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku.
  2. Qwen technical report.
  3. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.
  4. Language models are few-shot learners.
  5. Webqa: Multihop and multimodal qa.
  6. Murag: Multimodal retrieval-augmented generator for open question answering over images and text.
  7. HybridQA: A dataset of multi-hop question answering over tabular and textual data. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1026–1036, Online. Association for Computational Linguistics.
  8. Can pre-trained vision and language models answer visual information-seeking questions?
  9. Reproducible scaling laws for contrastive language-image learning.
  10. Chroma. 2022. Home | Chroma — docs.trychroma.com. https://docs.trychroma.com/. [Accessed 03-05-2024].
  11. Bert: Pre-training of deep bidirectional transformers for language understanding.
  12. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs.
  13. Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  14. Ragas: Automated evaluation of retrieval augmented generation.
  15. Team Google. 2024. Gemini: A family of highly capable multimodal models.
  16. Measuring mathematical problem solving with the math dataset.
  17. Scaling up visual and vision-language representation learning with noisy text supervision.
  18. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 252–262, New Orleans, Louisiana. Association for Computational Linguistics.
  19. Prometheus 2: An open source language model specialized in evaluating other language models.
  20. Benchmarking cognitive biases in large language models as evaluators. ArXiv, abs/2309.17012.
  21. Bioasq-qa: A manually curated corpus for biomedical question answering. Scientific Data, 10(1):170.
  22. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
  23. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.
  24. Retrieval-augmented generation for knowledge-intensive nlp tasks.
  25. Weizhe Lin and Bill Byrne. 2022. Retrieval augmented visual question answering with outside knowledge.
  26. Visual instruction tuning.
  27. Do question answering modeling improvements hold across benchmarks?
  28. Universal vision-language dense retrieval: Learning a unified representation space for multi-modal retrieval.
  29. Learn to explain: Multimodal reasoning via thought chains for science question answering.
  30. Unifying text, tables, and images for multimodal question answering. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9355–9367, Singapore. Association for Computational Linguistics.
  31. Ok-vqa: A visual question answering benchmark requiring external knowledge.
  32. Team OpenAI. 2024. Gpt-4 technical report.
  33. Unsupervised multi-hop question answering by question generation. arXiv preprint arXiv:2010.12623.
  34. Generating natural questions from images for multimodal assistants.
  35. Instruction tuning with gpt-4.
  36. Training question answering models from synthetic data.
  37. Learning transferable visual models from natural language supervision.
  38. Squad: 100,000+ questions for machine comprehension of text.
  39. End-to-end synthetic data generation for domain adaptation of question answering systems.
  40. Multimodalqa: Complex question answering over text, tables and images.
  41. Prompt2model: Generating deployable models from natural language instructions.
  42. Asking and answering questions to evaluate the factual consistency of summaries.
  43. Text embeddings by weakly-supervised contrastive pre-training.
  44. Zichao Wang and Richard Baraniuk. 2023. Multiqg-ti: Towards question generation from multi-modal sources. arXiv preprint arXiv:2307.04643.
  45. Uniir: Training and benchmarking universal multimodal information retrievers.
  46. Hotpotqa: A dataset for diverse, explainable multi-hop question answering.
  47. Retrieval-augmented multimodal language modeling.
  48. Metamath: Bootstrap your own mathematical questions for large language models.
  49. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi.
  50. Retrieving multimodal information for augmented generation: A survey.
  51. Judging llm-as-a-judge with mt-bench and chatbot arena.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Ian Wu (6 papers)
  2. Sravan Jayanthi (4 papers)
  3. Vijay Viswanathan (14 papers)
  4. Simon Rosenberg (2 papers)
  5. Sina Pakazad (3 papers)
  6. Tongshuang Wu (53 papers)
  7. Graham Neubig (342 papers)
Citations (2)

Summary

Synthetic Multimodal Question Generation

The paper "Synthetic Multimodal Question Generation" introduces an innovative framework called SMMQG, designed to address the limitations of current Multimodal Retrieval Augmented Generation (MMRAG) systems for question-answering tasks. The SMMQG framework aims to generate high-quality, style- and modality-specific question-answer pairs from multimodal documents, thereby enabling more nuanced evaluation and benchmarking of retrieval and QA models.

Key Contributions

  1. Synthetic Data Generation Framework: SMMQG leverages the interplay between a retriever, a LLM, and a large multimodal model (LMM) to generate questions and answers grounded in multimodal sources. This framework allows for precise control over the question styles and modalities, addressing a critical gap in existing evaluation datasets which are usually fixed and non-configurable.
  2. Dataset Creation: Utilizing SMMQG, the authors generated a comprehensive dataset consisting of 1024 questions from Wikipedia documents. The dataset encompasses various question styles—information extraction, compare contrast, numerical, compound, and multi-hop—and modalities such as text, tables, and images.
  3. Evaluation Metrics: The paper demonstrates the effectiveness of SMMQG by evaluating state-of-the-art retrievers (such as BM25, E5-Large, and OpenCLIP) and QA models using the generated dataset. This includes separate evaluations for retrieval recall and QA performance, providing insights that would not be discernible through generic evaluation metrics.
  4. Human Study and Concurrence Measurement: A human paper compared the quality of the SMMQG-generated dataset with the popular crowdsourced benchmark MMQA. The results showed that SMMQG's questions are statistically significantly more fluent and answerable, while maintaining high correctness. Additionally, the concurrence analysis demonstrated strong agreement between SMMQG and MMQA in discriminating model performance, thus validating the utility of the synthetic dataset for model evaluation.

Implications

  • Practical Applications:

The ability of SMMQG to generate high-quality, style-specific, and modality-specific questions enables more robust and detailed evaluation of MMRAG systems. This can significantly enhance the development and deployment of QA systems in real-world applications where multimodal content is prevalent.

  • Theoretical Contributions:

The paper contributes to the growing field of synthetic data generation by introducing methodologies that ensure the generated data's relevance and adherence to specified question styles and modalities. This opens new avenues for research in areas requiring specialized and high-quality synthetic datasets.

Future Directions

  • Expanding Question Styles:

Future work could explore the inclusion of additional question styles that require different types of reasoning and knowledge, thereby broadening the evaluation scope of QA systems.

  • Integration with Diverse Domains:

While the current work focuses on Wikipedia documents, integrating SMMQG with other domains, such as medical, legal, or scientific texts, could verify its adaptability and robustness across various fields.

  • End-to-End Evaluation:

The paper focuses on evaluating QA models using correctly retrieved sources. Future research should consider end-to-end evaluation, where the retrieval component's impact on the overall QA performance is directly assessed.

In conclusion, the introduction of SMMQG marks a significant step towards refined and customizable evaluation of MMRAG systems. The ability to simulate diverse and realistic question-answering scenarios is crucial for advancing the performance and applicability of multimodal QA models. The combination of fine-grained control over question styles and modalities, along with rigorous human validation, underscores the potential of SMMQG to set new standards in synthetic data generation and QA model evaluation.