Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning (2105.14517v3)

Published 30 May 2021 in cs.AI

Abstract: Automatic math problem solving has recently attracted increasing attention as a long-standing AI benchmark. In this paper, we focus on solving geometric problems, which requires a comprehensive understanding of textual descriptions, visual diagrams, and theorem knowledge. However, the existing methods were highly dependent on handcraft rules and were merely evaluated on small-scale datasets. Therefore, we propose a Geometric Question Answering dataset GeoQA, containing 4,998 geometric problems with corresponding annotated programs, which illustrate the solving process of the given problems. Compared with another publicly available dataset GeoS, GeoQA is 25 times larger, in which the program annotations can provide a practical testbed for future research on explicit and explainable numerical reasoning. Moreover, we introduce a Neural Geometric Solver (NGS) to address geometric problems by comprehensively parsing multimodal information and generating interpretable programs. We further add multiple self-supervised auxiliary tasks on NGS to enhance cross-modal semantic representation. Extensive experiments on GeoQA validate the effectiveness of our proposed NGS and auxiliary tasks. However, the results are still significantly lower than human performance, which leaves large room for future research. Our benchmark and code are released at https://github.com/chen-judge/GeoQA .

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jiaqi Chen (89 papers)
  2. Jianheng Tang (31 papers)
  3. Jinghui Qin (27 papers)
  4. Xiaodan Liang (319 papers)
  5. Lingbo Liu (40 papers)
  6. Eric P. Xing (192 papers)
  7. Liang Lin (319 papers)
Citations (110)

Summary

  • The paper presents GeoQA, a large dataset of 4,998 geometric problems with detailed annotations to support multimodal numerical reasoning.
  • The proposed Neural Geometric Solver (NGS) integrates text and diagram inputs using a co-attention mechanism to generate symbolic programs.
  • Empirical results demonstrate NGS's superiority over baselines, highlighting the importance of cross-modal reasoning in tackling complex geometric challenges.

An Expert Overview of the "GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning" Paper

This paper proposes "GeoQA," a novel dataset designed to facilitate advancements in solving geometric problems within mathematical word problems (MWPs). The dataset significantly expands upon existing resources by introducing a comprehensive collection of 4,998 geometric questions, each accompanied by detailed annotations that describe both the problem-solving process and the relevant geometric programs. GeoQA is poised to serve as a pivotal resource for research in multimodal numerical reasoning, as it necessitates the integration of textual, diagrammatic, and theorem-based components to formulate solutions. Such integration reflects the multifaceted nature of real-world geometric problem-solving, which surpasses the scope of prior, smaller datasets like GeoS in both scale and complexity.

The paper also introduces the Neural Geometric Solver (NGS), a novel approach aimed at tackling the intricate challenge of geometry problem-solving by leveraging multimodal inputs. NGS adeptly parses these inputs to generate symbolic programs that illustrate the reasoning process more transparently. The system's design incorporates self-supervised auxiliary tasks that augment cross-modal semantic representation, thereby enhancing its interpretative accuracy.

Dataset and Contribution

GeoQA marks a substantial leap forward from its predecessors. GeoQA is 25 times larger than GeoS and introduces domain-specific programs that model the precise operational steps required to solve geometry problems systematically. The dataset is compiled from real-world examinations in China, ensuring the questions' relevance and diversity. The corpus is uniquely marked by its annotations, which serve as a topical blueprint for interpretative algorithms aimed at the academic and application-oriented expansion of numerical reasoning approaches. Empirical evaluations have underscored the task's intrinsic challenge, linked to the requirement for harmonizing multiple data modalities to extract coherent solutions.

Methodology and Implementation

NGS is articulated through a series of interconnected modules that merge and synthesize the diverse inputs of text and diagrams for geometric problem-solving. The architecture comprises a text encoder and a diagram encoder, integrated via a co-attention mechanism designed to optimize cross-modal information exchange. The model's decoding component employs LSTM to forecast sequentially interpretable programs, which articulate the procedural essence of the problem-solving pathway.

In augmenting NGS, auxiliary tasks such as jigsaw location prediction and geometry elements prediction are implemented to pretrain the diagram encoder. These tasks focus on fortifying the model's capacity to comprehend the individual components of diagrams, thereby refining its ability to resolve the program's operational sequence accurately. Furthermore, forecasting knowledge points relevant to each problem acts as a simultaneous training objective, thereby reinforcing the model's overall acumen.

Experimental Evaluation

The empirical results are notably rigorous, with NGS demonstrating marked superiority over several baselines in solving multimodal tasks. While a substantial gap persists between model and human performance when confronted with the GeoQA challenge, the results lay a critical foundation for future research. The exhaustive breakdown between text-only and text-diagram models further emphasizes the necessity of multimodal reasoning capabilities to address geometry questions effectively.

Implications and Future Directions

The development of GeoQA and NGS offers substantive implications for the field of AI-driven education technology. The explicit focus on multimodal reasoning aligns with the educational aim of developing robust AI systems capable of mimicking human pedagogical dialogues, potentially applied within digital learning environments. Beyond applications, the theoretical evolution stimulated by GeoQA’s benchmarks could drive future innovations in other domains necessitating similar cross-modal reasoning abilities.

In sum, GeoQA and NGS delineate a strategic pathway for AI researchers, juxtaposing the challenges of modern geometric problem solving with state-of-the-art methodological advancements. The paper strategically seeds an avenue for further exploration into the synthesis, interpretation, and enactment of multimodal data inputs in procedural reasoning tasks.