Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs (2501.06430v1)

Published 11 Jan 2025 in cs.CV

Abstract: Current multimodal LLMs (MLLMs) often underperform on mathematical problem-solving tasks that require fine-grained visual understanding. The limitation is largely attributable to inadequate perception of geometric primitives during image-level contrastive pre-training (e.g., CLIP). While recent efforts to improve math MLLMs have focused on scaling up mathematical visual instruction datasets and employing stronger LLM backbones, they often overlook persistent errors in visual recognition. In this paper, we systematically evaluate the visual grounding capabilities of state-of-the-art MLLMs and reveal a significant negative correlation between visual grounding accuracy and problem-solving performance, underscoring the critical role of fine-grained visual understanding. Notably, advanced models like GPT-4o exhibit a 70% error rate when identifying geometric entities, highlighting that this remains a key bottleneck in visual mathematical reasoning. To address this, we propose a novel approach, SVE-Math (Selective Vision-Enhanced Mathematical MLLM), featuring a geometric-grounded vision encoder and a feature router that dynamically adjusts the contribution of hierarchical visual feature maps. Our model recognizes accurate visual primitives and generates precise visual prompts tailored to the LLM's reasoning needs. In experiments, SVE-Math-Qwen2.5-7B outperforms other 7B models by 15% on MathVerse and is compatible with GPT-4V on MathVista. Despite being trained on smaller datasets, SVE-Math-7B achieves competitive performance on GeoQA, rivaling models trained on significantly larger datasets. Our findings emphasize the importance of incorporating fine-grained visual understanding into MLLMs and provide a promising direction for future research.

PDF Abstract

The paper "Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs" addresses the challenges faced by multimodal LLMs (MLLMs) in solving mathematical problems that require a nuanced comprehension of visual data. It highlights the limitations of existing MLLMs, such as GPT-4o, which exhibit a high error rate (approximately 70%) when identifying geometric primitives—a critical aspect of visual mathematical reasoning.

The paper identifies that the current underperformance can be largely attributed to inadequate perception of geometric primitives during image-level contrastive pre-training processes, such as those used by CLIP. Current attempts to address these issues often focus on increasing the scale of mathematical visual instruction datasets and strengthening the backbone LLMs. However, these efforts sometimes neglect persistent errors in visual recognition, which have a significant negative correlation with problem-solving performance.

To address these limitations, the paper proposes a novel approach called SVE-Math (Selective Vision-Enhanced Mathematical MLLM). This framework includes:

A Geometric-Grounded Vision Encoder: A specialized encoder tasked with recognizing and accurately grounding geometric primitives. This targets the root cause of visual misrecognition by enhancing the model's perception capabilities.
A Feature Router: This dynamic mechanism adjusts the contribution of hierarchical visual feature maps, ensuring that the MLLM receives pertinent visual information without redundant cues. The feature router creates visual soft prompts that are tailored to the LLM's needs, improving the reasoning process.

The proposed SVE-Math model demonstrates significant improvements in handling mathematical visual reasoning tasks. In experiments, SVE-Math-Qwen2.5-7B, a model using the proposed framework, outperforms other 7B parameter models by a substantial margin—15% on the MathVerse benchmark—and achieves performance compatibility with advanced models like GPT-4V on the MathVista dataset. Using smaller training datasets, SVE-Math also achieves competitive results on GeoQA, approaching the efficacy of models trained on larger datasets.

This paper underscores the importance of integrating fine-grained visual understanding and adaptive visual cue processing into multimodal LLMs. By shifting the focus towards enhancing visual perception rather than merely scaling data or computational complexity, the paper offers a promising direction for future research in the development of capable visual mathematical reasoning systems within MLLMs. The authors provide the implementation of SVE-Math on their GitHub repository for further exploration and development by the community.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Shan Zhang (84 papers)
Aotian Chen (2 papers)
Yanpeng Sun (14 papers)
Jindong Gu (101 papers)
Yi-Yu Zheng (1 paper)
Piotr Koniusz (84 papers)
Kai Zou (24 papers)
Anton van den Hengel (188 papers)
Yuan Xue (59 papers)

Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs (2501.06430v1)

Related Papers