VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning (2410.22995v1)
Abstract: Although previous research on LLMs and large multi-modal models (LMMs) has systematically explored mathematical problem-solving (MPS) within visual contexts, the analysis of how these models process visual information during problem-solving remains insufficient. To address this gap, we present VisAidMath, a benchmark for evaluating the MPS process related to visual information. We follow a rigorous data curation pipeline involving both automated processes and manual annotations to ensure data quality and reliability. Consequently, this benchmark includes 1,200 challenging problems from various mathematical branches, vision-aid formulations, and difficulty levels, collected from diverse sources such as textbooks, examination papers, and Olympiad problems. Based on the proposed benchmark, we conduct comprehensive evaluations on ten mainstream LLMs and LMMs, highlighting deficiencies in the visual-aided reasoning process. For example, GPT-4V only achieves 45.33% accuracy in the visual-aided reasoning task, even with a drop of 2 points when provided with golden visual aids. In-depth analysis reveals that the main cause of deficiencies lies in hallucination regarding the implicit visual reasoning process, shedding light on future research directions in the visual-aided MPS process.
- Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
- Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022.
- A knowledge-aware sequence-to-tree network for math word problem solving. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 7137–7146, 2020.
- Lime: Learning inductive bias for primitives of mathematical reasoning. In International Conference on Machine Learning, pages 11251–11262. PMLR, 2021.
- Clevr-math: A dataset for compositional language, visual and mathematical reasoning. arXiv preprint arXiv:2208.05358, 2022.
- Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022.
- Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Mistral: Dynamically managing power, performance, and adaptation cost in cloud infrastructures. In 2010 IEEE 30th International Conference on Distributed Computing Systems, pages 62–73. IEEE, 2010.
- OpenAI. https://platform.openai.com/docs/models/gpt-3-5-turbo. In OpenAI. OpenAI, 2023.
- OpenAI. https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4. In OpenAI. OpenAI, 2023.
- Llava-next: Improved reasoning, ocr, and world knowledge, 2024.
- Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 1(2):3, 2023.
- Google. https://labelbox.com/product/model/foundry-models/google-gemini-pro-vision/. In Google. Google, 2023.
- AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 2024.
- OpenAI. https://openai.com/index/gpt-4v-system-card/. In OpenAI. OpenAI, 2023.
- Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023.
- How well do large language models perform in arithmetic tasks? arXiv preprint arXiv:2304.02015, 2023.
- Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023.
- Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression. arXiv preprint arXiv:2212.02746, 2022.
- Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165, 2021.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
- Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR, 2023.
- Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
- larryflynt. https://github.com/larryflynt/image-concat. In Github, 2023.
- Transformers: State-of-the-art natural language processing. In Qun Liu and David Schlangen, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics.