Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models (2403.00231v3)
Abstract: Large vision-LLMs (LVLMs) excel across diverse tasks involving concrete images from natural scenes. However, their ability to interpret abstract figures, such as geometry shapes and scientific plots, remains limited due to a scarcity of training datasets in scientific domains. To fill this gap, we introduce Multimodal ArXiv, consisting of ArXivCap and ArXivQA, for enhancing LVLMs scientific comprehension. ArXivCap is a figure-caption dataset comprising 6.4M images and 3.9M captions, sourced from 572K ArXiv papers spanning various scientific domains. Drawing from ArXivCap, we introduce ArXivQA, a question-answering dataset generated by prompting GPT-4V based on scientific figures. ArXivQA greatly enhances open-sourced LVLMs' mathematical reasoning capabilities, achieving a 10.4\% absolute accuracy gain on a multimodal mathematical reasoning benchmark. Furthermore, employing ArXivCap, we devise four vision-to-text tasks for benchmarking LVLMs. Evaluation results with state-of-the-art LVLMs underscore their struggle with the nuanced semantics of academic figures, while domain-specific training yields substantial performance gains. Our error analysis uncovers misinterpretations of visual context, recognition errors, and the production of overly simplified captions by current LVLMs, shedding light on future improvements.
- Flamingo: a visual language model for few-shot learning. ArXiv preprint, abs/2204.14198.
- Openflamingo: An open-source framework for training large autoregressive vision-language models. ArXiv preprint, abs/2308.01390.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv preprint, abs/2308.12966.
- Introducing our multimodal models.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
- Figure captioning with relation maps for reasoning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1537–1545.
- Figure captioning with reasoning and sequence-level training. ArXiv preprint, abs/1906.02850.
- Sharegpt4v: Improving large multi-modal models with better captions. ArXiv preprint, abs/2311.12793.
- Pali-x: On scaling up a multilingual vision and language model. ArXiv preprint, abs/2305.18565.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- On the use of arxiv as a dataset. ArXiv preprint, abs/1905.00075.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv preprint, abs/2305.06500.
- A survey for in-context learning.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. ArXiv preprint, abs/2306.13394.
- G-llava: Solving geometric problem with multi-modal large language model.
- Google. 2023. Bard.
- SciCap: Generating captions for scientific figures. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3258–3264.
- mplug-paperowl: Scientific diagram analysis with the multimodal large language model.
- ImageMagick Studio LLC. Imagemagick.
- DVQA: understanding data visualizations via question answering. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 5648–5656.
- Figureqa: An annotated figure dataset for visual reasoning. ArXiv preprint, abs/1710.07300.
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
- The semantic scholar open data platform.
- Obelics: An open web-scale filtered dataset of interleaved image-text documents.
- Llava-med: Training a large language-and-vision assistant for biomedicine in one day. ArXiv preprint, abs/2306.00890.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv preprint, abs/2301.12597.
- Silkie: Preference distillation for large visual language models. ArXiv preprint, abs/2312.10665.
- M33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTIT: A large-scale dataset towards multi-modal multilingual instruction tuning. ArXiv preprint, abs/2306.04387.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer.
- Improved baselines with visual instruction tuning.
- Visual instruction tuning. ArXiv preprint, abs/2304.08485.
- Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models. ArXiv preprint, abs/2310.02255.
- Brielen Madureira. 2021. Flamingos and hedgehogs in the croquet-ground: Teaching evaluation of NLP systems for undergraduate students. In Proceedings of the Fifth Workshop on Teaching NLP, pages 87–91.
- OpenAI. 2023. Gpt-4v(ision) system card.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. ArXiv preprint, abs/2111.02114.
- Aligning large multimodal models with factually augmented rlhf. ArXiv preprint, abs/2309.14525.
- Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971.
- Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 11445–11465.
- The dawn of lmms: Preliminary explorations with gpt-4v (ision). ArXiv preprint, abs/2309.17421.
- Scicap+: A knowledge augmented dataset to study the challenges of scientific figure captioning. ArXiv preprint, abs/2306.03491.
- Capsfusion: Rethinking image-text data at scale. ArXiv preprint, abs/2310.20550.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. ArXiv preprint, abs/2311.16502.
- Bertscore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.
- Llavar: Enhanced visual instruction tuning for text-rich image understanding.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv preprint, abs/2304.10592.
- Lei Li (1293 papers)
- Yuqi Wang (62 papers)
- Runxin Xu (30 papers)
- Peiyi Wang (48 papers)
- Xiachong Feng (28 papers)
- Lingpeng Kong (134 papers)
- Qi Liu (485 papers)