Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NEWTON: Are Large Language Models Capable of Physical Reasoning? (2310.07018v1)

Published 10 Oct 2023 in cs.CL, cs.AI, and cs.RO

Abstract: LLMs, through their contextualized representations, have been empirically proven to encapsulate syntactic, semantic, word sense, and common-sense knowledge. However, there has been limited exploration of their physical reasoning abilities, specifically concerning the crucial attributes for comprehending everyday objects. To address this gap, we introduce NEWTON, a repository and benchmark for evaluating the physics reasoning skills of LLMs. Further, to enable domain-specific adaptation of this benchmark, we present a pipeline to enable researchers to generate a variant of this benchmark that has been customized to the objects and attributes relevant for their application. The NEWTON repository comprises a collection of 2800 object-attribute pairs, providing the foundation for generating infinite-scale assessment templates. The NEWTON benchmark consists of 160K QA questions, curated using the NEWTON repository to investigate the physical reasoning capabilities of several mainstream LLMs across foundational, explicit, and implicit reasoning tasks. Through extensive empirical analysis, our results highlight the capabilities of LLMs for physical reasoning. We find that LLMs like GPT-4 demonstrate strong reasoning capabilities in scenario-based tasks but exhibit less consistency in object-attribute reasoning compared to humans (50% vs. 84%). Furthermore, the NEWTON platform demonstrates its potential for evaluating and enhancing LLMs, paving the way for their integration into physically grounded settings, such as robotic manipulation. Project site: https://newtonreasoning.github.io

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Nancy E Adams. 2015. Bloom’s taxonomy of cognitive learning objectives. Journal of the Medical Library Association: JMLA, 103(3):152.
  2. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691.
  3. Prost: Physical reasoning of objects through space and time.
  4. Phyre: A new benchmark for physical reasoning. Advances in Neural Information Processing Systems, 32.
  5. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439.
  6. The ycb object and model set: Towards common benchmarks for manipulation research. In 2015 international conference on advanced robotics (ICAR), pages 510–517. IEEE.
  7. Susan Carey. 2000. The origin of concepts. Journal of Cognition and Development, 1(1):37–41.
  8. Codah: An adversarially authored question-answer dataset for common sense. arXiv preprint arXiv:1904.04365.
  9. Scaling instruction-finetuned language models.
  10. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  11. Abo: Dataset and benchmarks for real-world 3d object understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21126–21136.
  12. Embodied question answering.
  13. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153.
  14. Bert: Pre-training of deep bidirectional transformers for language understanding.
  15. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE.
  16. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378.
  17. Pip: Physical interaction prediction via mental imagery with span selection. arXiv preprint arXiv:2109.04683, 1.
  18. Actionet: An interactive end-to-end platform for task-based data collection and augmentation in 3d environment. In 2020 IEEE International Conference on Image Processing (ICIP), pages 1566–1570. IEEE.
  19. Iqa: Visual question answering in interactive environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4089–4098.
  20. Ptr: A benchmark for part-based conceptual, relational, and physical reasoning. Advances in Neural Information Processing Systems, 34:17427–17440.
  21. Lora: Low-rank adaptation of large language models.
  22. Language models (mostly) know what they know.
  23. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern recognition, pages 4999–5007.
  24. Dynabench: Rethinking benchmarking in nlp.
  25. Code as policies: Language model programs for embodied control. In arXiv preprint arXiv:2209.07753.
  26. Intuitive physics: the straight-down belief and its origin. Journal of Experimental Psychology: Learning, Memory, and Cognition, 9(4):636.
  27. George A Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41.
  28. OpenAI. 2023. Gpt-4 technical report.
  29. Beyond accuracy: Behavioral testing of nlp models with checklist.
  30. Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension. ACM Computing Surveys, 55(10):1–45.
  31. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749.
  32. Habitat rearrangement challenge 2022. https://aihabitat.org/challenge/rearrange_2022.
  33. Tidybot: Personalized robot assistance with large language models. arXiv preprint arXiv:2305.05658.
  34. Swag: A large-scale adversarial dataset for grounded commonsense inference. arXiv preprint arXiv:1808.05326.
  35. A survey of large language models.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yi Ru Wang (12 papers)
  2. Jiafei Duan (26 papers)
  3. Dieter Fox (201 papers)
  4. Siddhartha Srinivasa (52 papers)
Citations (15)