Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning (2405.15064v1)
Abstract: Spatial reasoning plays a vital role in both human cognition and machine intelligence, prompting new research into LLMs' (LMs) capabilities in this regard. However, existing benchmarks reveal shortcomings in evaluating qualitative spatial reasoning (QSR). These benchmarks typically present oversimplified scenarios or unclear natural language descriptions, hindering effective evaluation. We present a novel benchmark for assessing QSR in LMs, which is grounded in realistic 3D simulation data, offering a series of diverse room layouts with various objects and their spatial relationships. This approach provides a more detailed and context-rich narrative for spatial reasoning evaluation, diverging from traditional, toy-task-oriented scenarios. Our benchmark encompasses a broad spectrum of qualitative spatial relationships, including topological, directional, and distance relations. These are presented with different viewing points, varied granularities, and density of relation constraints to mimic real-world complexities. A key contribution is our logic-based consistency-checking tool, which enables the assessment of multiple plausible solutions, aligning with real-world scenarios where spatial relationships are often open to interpretation. Our benchmark evaluation of advanced LMs reveals their strengths and limitations in spatial reasoning. They face difficulties with multi-hop spatial reasoning and interpreting a mix of different view descriptions, pointing to areas for future improvement.
- Online perceptual learning and natural language acquisition for autonomous robots. Artificial Intelligence, 303:103637, 2022.
- A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs. arXiv preprint arXiv:2304.11164, 2023.
- Qualitative spatial representation and reasoning. In Frank Van Harmelen, Vladimir Lifschitz, and Bruce Porter, editors, Handbook of knowledge representation, pages 551–596. Elsevier, 2008.
- Procthor: Large-scale embodied ai using procedural generation. Advances in Neural Information Processing Systems, 35:5982–5994, 2022.
- J B Haviland. Guugu Yimithirr cardinal directions. Ethos, 26(1):25–47, 1998.
- Ontology knowledge-enhanced in-context learning for action-effect prediction. Advances in Cognitive Systems. ACS-2022, 2022.
- Advancing spatial reasoning in large language models: An in-depth evaluation and enhancement using the stepgame benchmark. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18500–18507, 2024.
- Gerard Ligozat. Reasoning about cardinal directions. Journal of Visual Languages & Computing, 9(1):23–44, 1998.
- Transfer learning with synthetic corpora for spatial role labeling and reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6148–6165. Association for Computational Linguistics, December 2022.
- SpartQA: a textual question answering benchmark for spatial reasoning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4582–4598, 2021.
- OpenAI. GPT-4 technical report. ArXiv, abs/2303.08774, 2023.
- A spatial logic based on regions and connection. KR, 92:165–176, 1992.
- Stepgame: A new benchmark for robust multi-hop spatial reasoning in texts. In Proceedings of the AAAI conference on Artificial Intelligence, volume 36, pages 11321–11329, 2022.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Towards AI-complete question answering: A set of prerequisite toy tasks. In 4th International Conference on Learning Representations, ICLR, 2016.
- Fangjun Li (3 papers)
- David C. Hogg (11 papers)
- Anthony G. Cohn (24 papers)