Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LANS: A Layout-Aware Neural Solver for Plane Geometry Problem (2311.16476v2)

Published 25 Nov 2023 in cs.CV and cs.AI

Abstract: Geometry problem solving (GPS) is a challenging mathematical reasoning task requiring multi-modal understanding, fusion, and reasoning. Existing neural solvers take GPS as a vision-language task but are short in the representation of geometry diagrams that carry rich and complex layout information. In this paper, we propose a layout-aware neural solver named LANS, integrated with two new modules: multimodal layout-aware pre-trained language module (MLA-PLM) and layout-aware fusion attention (LA-FA). MLA-PLM adopts structural-semantic pre-training (SSP) to implement global relationship modeling, and point-match pre-training (PMP) to achieve alignment between visual points and textual points. LA-FA employs a layout-aware attention mask to realize point-guided cross-modal fusion for further boosting layout awareness of LANS. Extensive experiments on datasets Geometry3K and PGPS9K validate the effectiveness of the layout-aware modules and superior problem-solving performance of our LANS solver, over existing symbolic and neural solvers. The code will be made public available soon.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.
  3. Docformer: End-to-end transformer for document understanding. In CVPR.
  4. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
  5. Daniel G. Bobrow. 1968. Natural language input for a computer problem solving system. Semantic Information Processing.
  6. Jie Cao and Jing Xiao. 2022. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. COLING, 29.
  7. UniGeo: Unifying geometry logical reasoning via reformulating mathematical expression. In EMNLP.
  8. GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning. In Findings of ACL.
  9. Unifying vision-and-language tasks via text generation. In ICML.
  10. Automated generation of readable proofs with geometric invariants: II. Theorem proving with full-Angles. Journal of Automated Reasoning, 17(3).
  11. Mukea: Multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In CVPR.
  12. Bootstrap your own latent-a new approach to self-supervised learning. In NeurIPS.
  13. LayoutTransformer: Layout generation and completion with self-attention. In CVPR.
  14. Momentum contrast for unsupervised visual representation learning. In CVPR.
  15. Deep residual learning for image recognition. In CVPR.
  16. BROS: A pre-trained language model focusing on text and layout for better key information extraction from documents. In AAAI.
  17. Vilt: Vision-and-language transformer without convolution or region supervision. In ICML.
  18. Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. In AAAI.
  19. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML.
  20. Frontiers of intelligent document analysis and recognition: review and prospects. Journal of Image and Graphics, 28(08):2223–2252.
  21. Visual instruction tuning. In NeurIPS.
  22. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. In ICLR.
  23. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In ICLR.
  24. Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning. In ACL.
  25. A survey of deep learning for mathematical reasoning. In ACL.
  26. A symbolic characters aware model for solving geometry problems. In ACM MM.
  27. GeoDRL: A self-learning framework for geometry problem solving using reinforcement learning in deductive reasoning. In Findings of ACL.
  28. Learning transferable visual models from natural language supervision. In ICML.
  29. Mrinmaya Sachan and Eric Xing. 2017. Learning to solve geometry problems from natural language demonstrations in textbooks. In SEM.
  30. Solving geometry problems: Combining text and diagram interpretation. In EMNLP.
  31. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  32. Solving olympiad geometry without human demonstrations. Nature, 625(7995):476–482.
  33. Attention is all you need. In NeurIPS.
  34. LayoutReader: Pre-training of text and layout for reading order detection. In ACL.
  35. LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In ACL.
  36. LayoutLM: Pre-training of text and layout for document image understanding. In SIGKDD.
  37. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration.
  38. Deep modular co-attention networks for visual question answering. In CVPR.
  39. GAPS: geometry-aware problem solver. CoRR, abs/2401.16287.
  40. Plane geometry diagram parsing. In IJCAI.
  41. A multi-modal neural geometric solver with textual clauses parsed from diagram. In IJCAI.
Citations (6)

Summary

We haven't generated a summary for this paper yet.