PIQA: Reasoning about Physical Commonsense in Natural Language (1911.11641v1)

Published 26 Nov 2019 in cs.CL, cs.AI, and cs.LG

Abstract: To apply eyeshadow without a brush, should I use a cotton swab or a toothpick? Questions requiring this kind of physical commonsense pose a challenge to today's natural language understanding systems. While recent pretrained models (such as BERT) have made progress on question answering over more abstract domains - such as news articles and encyclopedia entries, where text is plentiful - in more physical domains, text is inherently limited due to reporting bias. Can AI systems learn to reliably answer physical common-sense questions without experiencing the physical world? In this paper, we introduce the task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA. Though humans find the dataset easy (95% accuracy), large pretrained models struggle (77%). We provide analysis about the dimensions of knowledge that existing models lack, which offers significant opportunities for future research.

PDF Abstract

PIQA: Reasoning about Physical Commonsense in Natural Language

The paper "PIQA: Reasoning about Physical Commonsense in Natural Language" by Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi introduces a novel benchmark designed to evaluate the physical commonsense reasoning capabilities of natural LLMs (NLMs). This benchmark, known as Physical Interaction: Question Answering (PIQA), focuses on scenarios where understanding physical properties and affordances of everyday objects is crucial. Unlike traditional question-answering (QA) tasks which lean heavily on abstract or encyclopedic knowledge, PIQA emphasizes more tangible, real-world interactions for which LLM training data is typically sparse.

Benchmark Description

PIQA tasks involve multiple-choice questions where one must select the more appropriate solution for a given physical scenario. For example, given the task of piercing ears, the more reasonable solution involves going to a professional rather than using a half-inch thick needle. The dataset comprises over 16,000 training pairs and additional sets for validation and testing, centering around goals and plausible solutions inspired by instructables.com. The benchmark minimizes stylistic and lexical biases using the AFLite algorithm, ensuring the focus remains on truly understanding physical commonsense rather than exploiting spurious patterns in language.

Key Findings

One of the most notable outcomes from the paper is the significant performance gap between large pre-trained LLMs and human participants. The human-level accuracy on PIQA is reported at 95%, whereas state-of-the-art models such as BERT, GPT, and RoBERTa achieve only around 66.8%, 69.2%, and 77.1% respectively. These results suggest a roughly 20-point gap to human performance, highlighting the challenges NLMs face in reasoning about physical interactions purely from textual data.

Analysis and Insights

The performance disparities observed in PIQA can be partially attributed to the nuanced understanding required for physical interactions, something that pretrained models typically lack due to their training on predominantly text-based corpora. For instance, RoBERTa, despite its robustness in various NLP tasks, struggles with simple but critical distinctions in physical commonsense like differentiating "top" and "bottom" or understanding the utility of objects like water and tools.

Further qualitative analysis indicates that models often fail in non-prototypical scenarios where the correct answers require innovative or less common uses of objects. For example, humans easily discern that a hair net is more suitable than a solid seal when trying to find something lost on a carpet, whereas models are easily misled by such subtleties.

Implications and Future Research

The discrepancies highlighted by PIQA indicate that current NLMs need to be augmented with more comprehensive contextual understanding that extends beyond textual representations to incorporate physical and interactive experiences. This calls for future strategies integrating grounded learning where models interact with and learn from the physical world, thereby acquiring a richer, more practical understanding of the interactions and affordances of objects.

The development of PIQA thus serves as a pivotal step toward creating more holistic AI systems. The key lies in advancing methods that encompass multimodal learning (combining textual and visual data), reinforcement learning, and leveraging sophisticated simulations that mimic real-world interactions.

Conclusion

The introduction of PIQA as a benchmark has established an essential framework for assessing the limits of current NLMs concerning physical commonsense reasoning. The current performance gap between state-of-the-art models and human-level understanding emphasizes the need for innovative approaches that transcend traditional text-based training paradigms. Future research, augmented by physical interaction and enhanced multimodal learning, may bridge this gap, enabling the development of AI systems with a deeper, more intuitive understanding of the physical world. The PIQA benchmark will undoubtedly play a crucial role in tracking this progress and guiding future AI research directions.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Yonatan Bisk (91 papers)
Rowan Zellers (25 papers)
Ronan Le Bras (56 papers)
Jianfeng Gao (344 papers)
Yejin Choi (287 papers)

Citations (1,295)

View on Semantic Scholar