GeomVerse: A Systematic Evaluation of Large Models for Geometric Reasoning (2312.12241v1)
Abstract: LLMs have shown impressive results for multi-hop mathematical reasoning when the input question is only textual. Many mathematical reasoning problems, however, contain both text and image. With the ever-increasing adoption of vision LLMs (VLMs), understanding their reasoning abilities for such problems is crucial. In this paper, we evaluate the reasoning capabilities of VLMs along various axes through the lens of geometry problems. We procedurally create a synthetic dataset of geometry questions with controllable difficulty levels along multiple axes, thus enabling a systematic evaluation. The empirical results obtained using our benchmark for state-of-the-art VLMs indicate that these models are not as capable in subjects like geometry (and, by generalization, other topics requiring similar reasoning) as suggested by previous benchmarks. This is made especially clear by the construction of our benchmark at various depth levels, since solving higher-depth problems requires long chains of reasoning rather than additional memorized knowledge. We release the dataset for further research in this area.
- Gpt-3-driven pedagogical agents to train children’s curious question-asking skills. International Journal of Artificial Intelligence in Education.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
- Penguins don’t fly: Reasoning about generics through instantiations and exceptions. arXiv preprint arXiv:2205.11658.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403.
- Critical thinking for language models. In Proceedings of the 14th International Conference on Computational Semantics (IWCS), pages 63–75, Groningen, The Netherlands (online). Association for Computational Linguistics.
- Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression. arXiv preprint arXiv:2212.02746.
- Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning. arXiv preprint arXiv:2105.14517.
- Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565.
- Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794.
- Transformers as soft reasoners over language. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI’20.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Selection-inference: Exploiting large language models for interpretable logical reasoning. In The Eleventh International Conference on Learning Representations.
- Explaining answers with entailment trees. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7358–7370, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Advancing mathematics by guiding human intuition with AI. Nature, 600(7887):70–74.
- Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610:47–53.
- Measuring mathematical problem solving with the math dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1. Curran.
- Large language models can self-improve. arXiv preprint arXiv:2210.11610.
- Boardgameqa: A dataset for natural language reasoning with contradictory information. In NeurIPS.
- Lambada: Backward chaining for automated reasoning in natural language. In ACL.
- Sal Khan. 2023. Khan Academy.
- Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
- Laura Kovács and Andrei Voronkov. 2013. First-order theorem proving and vampire. In International Conference on Computer Aided Verification, pages 1–35. Springer.
- Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
- Adam Dahlgren Lindström and Savitha Sam Abraham. 2022. Clevr-math: A dataset for compositional language, visual and mathematical reasoning. arXiv preprint arXiv:2208.05358.
- Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146.
- Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165.
- A survey of deep learning for mathematical reasoning. arXiv preprint arXiv:2212.10535.
- Teaching small language models to reason. arXiv preprint arXiv:2212.08410.
- Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244.
- Plotqa: Reasoning over scientific plots. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1527–1536.
- Show your work: Scratchpads for intermediate computation with language models. In Deep Learning for Code Workshop.
- OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. arXiv preprint arXiv:2305.12295.
- J. A. Robinson. 1965. A machine-oriented logic based on the resolution principle. J. ACM, 12(1):23–41.
- Testing the general deductive reasoning capacity of large language models using ood examples. arxiv preprint arXiv:2305.15269.
- Stephan Schulz. 2002. E–a brainiac theorem prover. AI Communications, 15(2, 3):111–126.
- IMO Grand Challenge.
- Solving geometry problems: Combining text and diagram interpretation. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 1466–1476.
- Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv:2210.09261.
- ProofWriter: Generating implications, proofs, and abductive statements over natural language. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3621–3634, Online. Association for Computational Linguistics.
- Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442.
- Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
- Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601.
- Neural story planning. arXiv preprint arXiv:2212.08718.
- STaR: Bootstrapping reasoning with reasoning. In Advances in Neural Information Processing Systems, volume 35, pages 15476–15488. Curran Associates, Inc.
- A multi-modal neural geometric solver with textual clauses parsed from diagram. arXiv preprint arXiv:2302.11097.
- AR-LSAT: Investigating analytical reasoning of text. arXiv preprint arXiv:2104.06598.
- Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations.
- Mehran Kazemi (26 papers)
- Hamidreza Alvari (22 papers)
- Ankit Anand (41 papers)
- Jialin Wu (30 papers)
- Xi Chen (1036 papers)
- Radu Soricut (54 papers)