Introduction
The capability of LLMs in complex reasoning tasks has shown impressive human-like performance, leading to extensive research in their application within mathematical problem solving. Despite their success in text-based mathematical problems, handling problems that involve geometric information, especially those requiring understanding of visual elements, remains challenging for current Multimodal LLMs (MLLMs).
Limitations of Current MLLMs
Existing MLLMs often fall short in comprehending basic geometric elements and their relationships. This limitation is partly due to most MLLMs being trained with images and descriptions from general domains rather than the specific semantics needed for geometric reasoning. To address this, researchers have developed multimodal datasets that enrich training with high-quality descriptions of geometric information. However, geometric problems present unique challenges, such as accurately interpreting figures and applying geometric principles, which are not fully met by current datasets. Recognizing this, the new Geo170K dataset was created, augmenting the largest public geometric problem dataset by incorporating more than 170,000 geometric image-caption and question-answer pairs.
Geo170K and G-LLaVA
Geo170K contains rich multimodal data that aims to equip LLMs with a deeper understanding of geometry. Using Geo170K, the paper introduces G-LLaVA, a model developed to solve geometric problems by better comprehending images and integrating text with visual information. G-LLaVA is constructed using a two-phase approach: geometric cross-modal alignment and geometric instruction tuning. The model significantly surpasses the performance of the previous state-of-the-art MLLMs, including GPT-4V, with only 7 billion parameters.
Observations and Conclusions
Observations highlight that while state-of-the-art MLLMs can adequately handle daily visual scenes, they struggle with geometric figures. G-LLaVA addresses these issues by utilizing an alignment dataset that provides basic geometric knowledge and an instruction-tuning dataset that refines problem-solving skills. Comparisons with conventional methods and across various difficulty levels and types of questions demonstrate G-LLaVA's superior performance in understanding and addressing geometric challenges. This work advocates for a continued evolution of multimodal LLMs for enhanced performance in geometric reasoning, presenting novel insights into developing models capable of solving geometry problems adeptly.