MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning (2506.05331v1)

Published 5 Jun 2025 in cs.CV

Abstract: Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in LLMs, but it still remains challenging for extending it to multimodal domains. Existing works either adopt a similar textual reasoning for image input, or seek to interleave visual signals into mathematical CoT. However, they face three key limitations for math problem-solving: reliance on coarse-grained box-shaped image regions, limited perception of vision encoders on math content, and dependence on external capabilities for visual modification. In this paper, we propose MINT-CoT, introducing Mathematical INterleaved Tokens for Chain-of-Thought visual reasoning. MINT-CoT adaptively interleaves relevant visual tokens into textual reasoning steps via an Interleave Token, which dynamically selects visual regions of any shapes within math figures. To empower this capability, we construct the MINT-CoT dataset, containing 54K mathematical problems aligning each reasoning step with visual regions at the token level, accompanied by a rigorous data generation pipeline. We further present a three-stage MINT-CoT training strategy, progressively combining text-only CoT SFT, interleaved CoT SFT, and interleaved CoT RL, which derives our MINT-CoT-7B model. Extensive experiments demonstrate the effectiveness of our method for effective visual interleaved reasoning in mathematical domains, where MINT-CoT-7B outperforms the baseline model by +34.08% on MathVista, +28.78% on GeoQA, and +23.2% on MMStar, respectively. Our code and data are available at https://github.com/xinyan-cxy/MINT-CoT

Summary

The paper introduces an innovative interleaved token mechanism that dynamically injects relevant visual tokens into each step of mathematical chain-of-thought reasoning.
It presents a novel MINT-CoT dataset of 54K visual reasoning problems and a three-stage training strategy combining text-only, supervised fine-tuning, and reinforcement learning.
Experimental results demonstrate that MINT-CoT-7B outperforms baselines by up to +34.08% on benchmarks such as MathVista, GeoQA, and MMStar, enhancing multimodal reasoning accuracy.

Overview of MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

This paper presents "MINT-CoT," a framework designed to enhance the mathematical reasoning capabilities of Multimodal LLMs (MLLMs) by interleaving visual tokens into textual reasoning processes. Despite the advancements in Chain-of-Thought (CoT) reasoning strategies, applying them to mathematical contexts that involve visual information remains a complex challenge. This paper delineates the development and efficacy of MINT-CoT in addressing these challenges through the introduction of Mathematical INterleaved Tokens for Chain-of-Thought visual reasoning.

Key Insights and Contributions

Interleave Token Mechanism: MINT-CoT incorporates an innovative Interleave Token, which dynamically integrates relevant visual tokens into each step of textual reasoning. This token is critical for selecting visual elements pertinent to a problem's mathematical context, optimizing the visual reasoning capabilities of MLLMs.
Dataset and Training Strategy: The research introduces the MINT-CoT dataset, comprising 54K visual reasoning problems where each step aligns with token-level visual regions. This dataset plays a pivotal role in training MLLMs to select relevant visual tokens adaptively. The MINT-CoT training strategy unfolds in three stages: text-only CoT fine-tuning, interleaved CoT supervised fine-tuning (SFT), and interleaved CoT reinforcement learning (RL).
Experimental Evaluation: Extensive experiments demonstrate the effectiveness of the MINT-CoT methodology. The proposed MINT-CoT-7B model outperforms baseline models with significant improvements: +34.08% on MathVista, +28.78% on GeoQA, and +23.2% on MMStar benchmarks. These results substantiate the utility of the MINT-CoT framework in enhancing problem-solving accuracy by bridging textual and visual data.

Technical Examination of the Approach

The paper meticulously identifies the inadequacies of existing methods in solving multimodal mathematical reasoning problems. Coarse-grained selection strategies and limited perceptual capacities of standard vision encoders are highlighted as significant barriers. By selecting and incorporating visual tokens at more granular levels, MINT-CoT offers a solution that not only enhances reasoning accuracy but also ensures that reasoning steps remain closely tied to their visual context.

Implications and Future Directions

The introduction of MINT-CoT is notable for its potential implications in AI research, particularly for developing models that require close integration and understanding of multimodal inputs. Practically, this can affect fields where mathematical problems are coupled with visual information, such as educational technologies and scientific computing.

The paper also sets the stage for further exploration into more sophisticated token selection mechanisms and integration methods, potentially using advanced reinforcement learning techniques. Future developments might explore deeper layers of integration at various stages of the reasoning process, leveraging more complex structural understandings that MLLMs might develop.

In conclusion, MINT-CoT presents a substantial advancement in the multimodal reasoning landscape, contributing both a novel methodological framework and a valuable dataset to the field. The results demonstrate a promising path for future work in the integration of visual and textual reasoning, advancing the state of artificial intelligence in mathematically-intensive domains.

PDF Markdown

GitHub

GitHub - xinyan-cxy/MINT-CoT (10 stars)