G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model (2312.11370v1)

Published 18 Dec 2023 in cs.CL

Abstract: LLMs have shown remarkable proficiency in human-level reasoning and generation capabilities, which encourages extensive research on their application in mathematical problem solving. However, current work has been largely focused on text-based mathematical problems, with limited investigation in problems involving geometric information. Addressing this gap, we aim to enable LLMs to solve geometric problems by understanding image input. We first analyze the limitations of current Multimodal LLMs (MLLMs) in this area: they struggle to accurately comprehending basic geometric elements and their relationships. To overcome these challenges, we take advantage of the unique characteristics of geometric problems (such as unique geometric logical form, and geometric scalability) and the capacity of the textual LLMs to build an enriched multimodal geometry dataset based on existing data. The augmented dataset, Geo170K, contains more than 170K geometric image-caption and question-answer pairs. Utilizing our constructed Geo170K dataset, we develop G-LLaVA, which demonstrates exceptional performance in solving geometric problems, significantly outperforming GPT-4-V on the MathVista benchmark with only 7B parameters.

PDF HTML Abstract

Introduction

The capability of LLMs in complex reasoning tasks has shown impressive human-like performance, leading to extensive research in their application within mathematical problem solving. Despite their success in text-based mathematical problems, handling problems that involve geometric information, especially those requiring understanding of visual elements, remains challenging for current Multimodal LLMs (MLLMs).

Limitations of Current MLLMs

Existing MLLMs often fall short in comprehending basic geometric elements and their relationships. This limitation is partly due to most MLLMs being trained with images and descriptions from general domains rather than the specific semantics needed for geometric reasoning. To address this, researchers have developed multimodal datasets that enrich training with high-quality descriptions of geometric information. However, geometric problems present unique challenges, such as accurately interpreting figures and applying geometric principles, which are not fully met by current datasets. Recognizing this, the new Geo170K dataset was created, augmenting the largest public geometric problem dataset by incorporating more than 170,000 geometric image-caption and question-answer pairs.

Geo170K and G-LLaVA

Geo170K contains rich multimodal data that aims to equip LLMs with a deeper understanding of geometry. Using Geo170K, the paper introduces G-LLaVA, a model developed to solve geometric problems by better comprehending images and integrating text with visual information. G-LLaVA is constructed using a two-phase approach: geometric cross-modal alignment and geometric instruction tuning. The model significantly surpasses the performance of the previous state-of-the-art MLLMs, including GPT-4V, with only 7 billion parameters.

Observations and Conclusions

Observations highlight that while state-of-the-art MLLMs can adequately handle daily visual scenes, they struggle with geometric figures. G-LLaVA addresses these issues by utilizing an alignment dataset that provides basic geometric knowledge and an instruction-tuning dataset that refines problem-solving skills. Comparisons with conventional methods and across various difficulty levels and types of questions demonstrate G-LLaVA's superior performance in understanding and addressing geometric challenges. This work advocates for a continued evolution of multimodal LLMs for enhanced performance in geometric reasoning, presenting novel insights into developing models capable of solving geometry problems adeptly.