Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model (2312.11370v1)

Published 18 Dec 2023 in cs.CL
G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model

Abstract: LLMs have shown remarkable proficiency in human-level reasoning and generation capabilities, which encourages extensive research on their application in mathematical problem solving. However, current work has been largely focused on text-based mathematical problems, with limited investigation in problems involving geometric information. Addressing this gap, we aim to enable LLMs to solve geometric problems by understanding image input. We first analyze the limitations of current Multimodal LLMs (MLLMs) in this area: they struggle to accurately comprehending basic geometric elements and their relationships. To overcome these challenges, we take advantage of the unique characteristics of geometric problems (such as unique geometric logical form, and geometric scalability) and the capacity of the textual LLMs to build an enriched multimodal geometry dataset based on existing data. The augmented dataset, Geo170K, contains more than 170K geometric image-caption and question-answer pairs. Utilizing our constructed Geo170K dataset, we develop G-LLaVA, which demonstrates exceptional performance in solving geometric problems, significantly outperforming GPT-4-V on the MathVista benchmark with only 7B parameters.

Introduction

The capability of LLMs in complex reasoning tasks has shown impressive human-like performance, leading to extensive research in their application within mathematical problem solving. Despite their success in text-based mathematical problems, handling problems that involve geometric information, especially those requiring understanding of visual elements, remains challenging for current Multimodal LLMs (MLLMs).

Limitations of Current MLLMs

Existing MLLMs often fall short in comprehending basic geometric elements and their relationships. This limitation is partly due to most MLLMs being trained with images and descriptions from general domains rather than the specific semantics needed for geometric reasoning. To address this, researchers have developed multimodal datasets that enrich training with high-quality descriptions of geometric information. However, geometric problems present unique challenges, such as accurately interpreting figures and applying geometric principles, which are not fully met by current datasets. Recognizing this, the new Geo170K dataset was created, augmenting the largest public geometric problem dataset by incorporating more than 170,000 geometric image-caption and question-answer pairs.

Geo170K and G-LLaVA

Geo170K contains rich multimodal data that aims to equip LLMs with a deeper understanding of geometry. Using Geo170K, the paper introduces G-LLaVA, a model developed to solve geometric problems by better comprehending images and integrating text with visual information. G-LLaVA is constructed using a two-phase approach: geometric cross-modal alignment and geometric instruction tuning. The model significantly surpasses the performance of the previous state-of-the-art MLLMs, including GPT-4V, with only 7 billion parameters.

Observations and Conclusions

Observations highlight that while state-of-the-art MLLMs can adequately handle daily visual scenes, they struggle with geometric figures. G-LLaVA addresses these issues by utilizing an alignment dataset that provides basic geometric knowledge and an instruction-tuning dataset that refines problem-solving skills. Comparisons with conventional methods and across various difficulty levels and types of questions demonstrate G-LLaVA's superior performance in understanding and addressing geometric challenges. This work advocates for a continued evolution of multimodal LLMs for enhanced performance in geometric reasoning, presenting novel insights into developing models capable of solving geometry problems adeptly.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Synthesis of solutions for shaded area geometry problems. In The Thirtieth International Flairs Conference.
  2. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  5. Jie Cao and Jing Xiao. 2022. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1511–1520.
  6. UniGeo: Unifying geometry logical reasoning via reformulating mathematical expression. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3313–3323.
  7. GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 513–523, Online. Association for Computational Linguistics.
  8. Shikra: Unleashing multimodal llm’s referential dialogue magic.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  10. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  11. Training verifiers to solve math word problems.
  12. Instructblip: Towards general-purpose vision-language models with instruction tuning.
  13. Specializing smaller language models towards multi-step reasoning.
  14. Self-guided noise-free data generation for efficient zero-shot learning.
  15. Llama-adapter v2: Parameter-efficient visual instruction model.
  16. Google. 2023. Gemini: A family of highly capable multimodal models.
  17. Tora: A tool-integrated reasoning agent for mathematical problem solving.
  18. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  19. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. In The Eleventh International Conference on Learning Representations.
  20. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692.
  21. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.
  22. Unimath: A foundational and multimodal mathematical reasoner. In EMNLP.
  23. Visual instruction tuning.
  24. Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models.
  25. Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning. In The 59th Annual Meeting of the Association for Computational Linguistics (ACL).
  26. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct.
  27. Generating training data with language models: Towards zero-shot language understanding. arXiv preprint arXiv:2202.04538.
  28. A symbolic characters aware model for solving geometry problems. In Proceedings of the 31st ACM International Conference on Multimedia, MM ’23, page 7767–7775, New York, NY, USA. Association for Computing Machinery.
  29. OpenAI. 2023. Gpt-4 technical report.
  30. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  31. Instruction tuning with gpt-4.
  32. Detgpt: Detect what you need via reasoning.
  33. Perceptiongpt: Effectively fusing visual perception into llm.
  34. Learning transferable visual models from natural language supervision.
  35. From textbooks to knowledge: A case study in harvesting axiomatic knowledge from textbooks to solve geometry problems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 773–784.
  36. Mrinmaya Sachan and Eric Xing. 2017. Learning to solve geometry problems from natural language demonstrations in textbooks. In Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (* SEM 2017), pages 251–261.
  37. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  38. Solving geometry problems: Combining text and diagram interpretation. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 1466–1476.
  39. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990.
  40. Pandagpt: One model to instruction-follow them all.
  41. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  42. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  43. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  44. Self-instruct: Aligning language models with self-generated instructions.
  45. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
  46. Large language models are better reasoners with self-verification. CoRR, abs/2212.09561.
  47. The dawn of lmms: Preliminary explorations with gpt-4v(ision).
  48. Zerogen: Efficient zero-shot learning via dataset generation. In Empirical Methods in Natural Language Processing.
  49. ProGen: Progressive zero-shot dataset generation via in-context feedback. In Findings of the Association for Computational Linguistics: EMNLP 2022.
  50. Metamath: Bootstrap your own mathematical questions for large language models.
  51. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi.
  52. Mammoth: Building math generalist models through hybrid instruction tuning.
  53. Gpt4roi: Instruction tuning large language model on region-of-interest.
  54. Sego: Sequential subgoal optimization for mathematical problem-solving. arXiv preprint arXiv:2310.12960.
  55. Decomposing the enigma: Subgoal-based demonstration learning for formal theorem proving. arXiv preprint arXiv:2305.16366.
  56. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625.
  57. Minigpt-4: Enhancing vision-language understanding with advanced large language models.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Jiahui Gao (25 papers)
  2. Renjie Pi (37 papers)
  3. Jipeng Zhang (46 papers)
  4. Jiacheng Ye (21 papers)
  5. Wanjun Zhong (49 papers)
  6. Yufei Wang (141 papers)
  7. Lanqing Hong (72 papers)
  8. Jianhua Han (49 papers)
  9. Hang Xu (204 papers)
  10. Zhenguo Li (195 papers)
  11. Lingpeng Kong (134 papers)
Citations (56)
Github Logo Streamline Icon: https://streamlinehq.com