LANS: A Layout-Aware Neural Solver for Plane Geometry Problem (2311.16476v2)
Abstract: Geometry problem solving (GPS) is a challenging mathematical reasoning task requiring multi-modal understanding, fusion, and reasoning. Existing neural solvers take GPS as a vision-language task but are short in the representation of geometry diagrams that carry rich and complex layout information. In this paper, we propose a layout-aware neural solver named LANS, integrated with two new modules: multimodal layout-aware pre-trained language module (MLA-PLM) and layout-aware fusion attention (LA-FA). MLA-PLM adopts structural-semantic pre-training (SSP) to implement global relationship modeling, and point-match pre-training (PMP) to achieve alignment between visual points and textual points. LA-FA employs a layout-aware attention mask to realize point-guided cross-modal fusion for further boosting layout awareness of LANS. Extensive experiments on datasets Geometry3K and PGPS9K validate the effectiveness of the layout-aware modules and superior problem-solving performance of our LANS solver, over existing symbolic and neural solvers. The code will be made public available soon.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.
- Docformer: End-to-end transformer for document understanding. In CVPR.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
- Daniel G. Bobrow. 1968. Natural language input for a computer problem solving system. Semantic Information Processing.
- Jie Cao and Jing Xiao. 2022. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. COLING, 29.
- UniGeo: Unifying geometry logical reasoning via reformulating mathematical expression. In EMNLP.
- GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning. In Findings of ACL.
- Unifying vision-and-language tasks via text generation. In ICML.
- Automated generation of readable proofs with geometric invariants: II. Theorem proving with full-Angles. Journal of Automated Reasoning, 17(3).
- Mukea: Multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In CVPR.
- Bootstrap your own latent-a new approach to self-supervised learning. In NeurIPS.
- LayoutTransformer: Layout generation and completion with self-attention. In CVPR.
- Momentum contrast for unsupervised visual representation learning. In CVPR.
- Deep residual learning for image recognition. In CVPR.
- BROS: A pre-trained language model focusing on text and layout for better key information extraction from documents. In AAAI.
- Vilt: Vision-and-language transformer without convolution or region supervision. In ICML.
- Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. In AAAI.
- BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML.
- Frontiers of intelligent document analysis and recognition: review and prospects. Journal of Image and Graphics, 28(08):2223–2252.
- Visual instruction tuning. In NeurIPS.
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. In ICLR.
- Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In ICLR.
- Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning. In ACL.
- A survey of deep learning for mathematical reasoning. In ACL.
- A symbolic characters aware model for solving geometry problems. In ACM MM.
- GeoDRL: A self-learning framework for geometry problem solving using reinforcement learning in deductive reasoning. In Findings of ACL.
- Learning transferable visual models from natural language supervision. In ICML.
- Mrinmaya Sachan and Eric Xing. 2017. Learning to solve geometry problems from natural language demonstrations in textbooks. In SEM.
- Solving geometry problems: Combining text and diagram interpretation. In EMNLP.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Solving olympiad geometry without human demonstrations. Nature, 625(7995):476–482.
- Attention is all you need. In NeurIPS.
- LayoutReader: Pre-training of text and layout for reading order detection. In ACL.
- LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In ACL.
- LayoutLM: Pre-training of text and layout for document image understanding. In SIGKDD.
- mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration.
- Deep modular co-attention networks for visual question answering. In CVPR.
- GAPS: geometry-aware problem solver. CoRR, abs/2401.16287.
- Plane geometry diagram parsing. In IJCAI.
- A multi-modal neural geometric solver with textual clauses parsed from diagram. In IJCAI.