Theorem-Validated Reverse Chain-of-Thought Problem Generation for Geometric Reasoning (2410.17885v4)
Abstract: Large Multimodal Models (LMMs) face limitations in geometric reasoning due to insufficient Chain of Thought (CoT) image-text training data. While existing approaches leverage template-based or LLM-assisted methods for geometric CoT data creation, they often face challenges in achieving both diversity and precision. To bridge this gap, we introduce a two-stage Theorem-Validated Reverse Chain-of-Thought Reasoning Synthesis (TR-CoT) framework. The first stage, TR-Engine, synthesizes theorem-grounded geometric diagrams with structured descriptions and properties. The second stage, TR-Reasoner, employs reverse reasoning to iteratively refine question-answer pairs by cross-validating geometric properties and description fragments. Our approach expands theorem-type coverage, corrects long-standing misunderstandings, and enhances geometric reasoning. Fine-grained CoT improves theorem understanding and increases logical consistency by 24.5%. Our best models surpass the baselines in MathVista and GeoQA by 10.1% and 4.7%, outperforming advanced closed-source models like GPT-4o.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023.
- An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In Proceedings of the 29th International Conference on Computational Linguistics, pp. 1511–1520, 2022.
- Allava: Harnessing gpt4v-synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684, 2024a.
- Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 513–523, 2021.
- Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 3313–3323, 2022.
- Sharegpt4v: Improving large multi-modal models with better captions. CoRR, 2023.
- Premise order matters in reasoning with large language models. In Forty-first International Conference on Machine Learning, 2024b.
- How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024c.
- Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24185–24198, 2024d.
- Muffin or chihuahua? challenging multimodal large language models with multipanel vqa. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6845–6863, 2024.
- G-llava: Solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370, 2023.
- Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. arXiv preprint arXiv:2406.09403, 2024.
- Mini-monkey: Alleviate the sawtooth effect by multi-scale adaptive cropping. arXiv preprint arXiv:2408.02034, 2024.
- Gpt-4o: The cutting-edge advancement in multimodal llm. Authorea Preprints, 2024.
- Geomverse: A systematic evaluation of large models for geometric reasoning. arXiv preprint arXiv:2312.12241, 2023.
- Saas: Solving ability amplification strategy for enhanced mathematical reasoning in large language models. arXiv preprint arXiv:2404.03887, 2024.
- Monkey: Image resolution and text label are important things for large multi-modal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26763–26773, 2024a.
- Synthesize step-by-step: Tools templates and llms as data generators for reasoning-based chart vqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13613–13623, 2024b.
- Look before you leap: Problem elaboration prompting improves mathematical reasoning in large language models. arXiv preprint arXiv:2402.15764, 2024.
- Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
- Deepseek-vl: Towards real-world vision-language understanding. CoRR, 2024.
- Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 6774–6786, 2021.
- Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In The 3rd Workshop on Mathematical Reasoning and AI at NeurIPS’23, 2023a.
- A survey of deep learning for mathematical reasoning. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023b.
- Beyond lines and circles: Unveiling the geometric reasoning gap in large language models. arXiv preprint arXiv:2402.03877, 2024.
- OpenAI. Openai o1 system card. preprint, 2024.
- Solving geometry problems: Combining text and diagram interpretation. In Proceedings of the 2015 conference on empirical methods in natural language processing, pp. 1466–1476, 2015.
- Math-llava: Bootstrapping mathematical reasoning for multimodal large language models. arXiv preprint arXiv:2406.17294, 2024.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Sequence to general tree: Knowledge-guided geometry word problem solving. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 964–972, 2021.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
- mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13040–13051, 2024.
- A multi-modal neural geometric solver with textual clauses parsed from diagram. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, pp. 3374–3382, 2023a.
- Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023b.
- Mavis: Mathematical visual instruction tuning. arXiv preprint arXiv:2407.08739, 2024.
- Evaluating the performance of large language models on gaokao benchmark. arXiv preprint arXiv:2305.12474, 2023c.
- Bba: Bi-modal behavioral alignment for reasoning with large vision-language models. arXiv preprint arXiv:2402.13577, 2024a.
- Stepwise self-consistent mathematical reasoning with large language models. arXiv preprint arXiv:2402.17786, 2024b.
- Dual instruction tuning with large language models for mathematical reasoning. arXiv preprint arXiv:2403.18295, 2024.
- Paraphrase and solve: Exploring and exploiting the impact of surface form on mathematical reasoning in large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 2793–2804, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.