Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating Commonsense in Pre-trained Language Models (1911.11931v2)

Published 27 Nov 2019 in cs.CL and cs.AI

Abstract: Contextualized representations trained over large raw text data have given remarkable improvements for NLP tasks including question answering and reading comprehension. There have been works showing that syntactic, semantic and word sense knowledge are contained in such representations, which explains why they benefit such tasks. However, relatively little work has been done investigating commonsense knowledge contained in contextualized representations, which is crucial for human question answering and reading comprehension. We study the commonsense ability of GPT, BERT, XLNet, and RoBERTa by testing them on seven challenging benchmarks, finding that LLMing and its variants are effective objectives for promoting models' commonsense ability while bi-directional context and larger training set are bonuses. We additionally find that current models do poorly on tasks require more necessary inference steps. Finally, we test the robustness of models by making dual test cases, which are correlated so that the correct prediction of one sample should lead to correct prediction of the other. Interestingly, the models show confusion on these test cases, which suggests that they learn commonsense at the surface rather than the deep level. We release a test set, named CATs publicly, for future research.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Xuhui Zhou (33 papers)
  2. Yue Zhang (620 papers)
  3. Leyang Cui (50 papers)
  4. Dandan Huang (8 papers)
Citations (174)

Summary

Evaluating Commonsense in Pre-trained LLMs

The paper "Evaluating Commonsense in Pre-trained LLMs," authored by Xuhui Zhou, Yue Zhang, Leyang Cui, and Dandan Huang, systematically examines the capacity of leading pre-trained LLMs to reason with commonsense knowledge. In particular, it provides an in-depth evaluation of models such as GPT, BERT, XLNet, and RoBERTa across a series of rigorous commonsense benchmarks to assess their ability beyond syntactic and semantic understanding, focusing on commonsense cognition.

Objectives and Methodology

The authors evaluate these models against seven commonsense-focused benchmarks: Conjunction Acceptability, Sense Making, Winograd Schema Challenge, SWAG, HellaSwag, Sense Making with Reasoning, and Argument Reasoning Comprehension. By reframing the benchmarks into sentence-scoring tasks, testing both word- and sentence-level reasoning, the authors aim to reveal the depth of commonsense understanding encoded within these contextualized representations. They implement tests to analyze model robustness through careful modifications of test instances to gauge consistency under variations.

Key Findings

  1. Effectiveness of LLMing: The research demonstrates that existing pre-trained models exceed random baselines across the majority of commonsense tasks, suggesting that LLMing objectives capture a significant amount of commonsense knowledge.
  2. Bi-directional Context Superiority: Models with bi-directional context, namely BERT, XLNet, and RoBERTa, consistently outperform uni-directional models like GPT and GPT2. This suggests a greater representational capability for commonsense reasoning in bi-directional models due to the holistic context awareness across sentences.
  3. Training Data Scale: Larger datasets confer an advantage, as evidenced by RoBERTa's superior performance. However, the benefit of extensive data is moderated by model capacity, indicating that parameter size and design are critical in leveraging data effectively for commonsense acquisition.
  4. Inference Step Challenges: Performance declines with increasing inference steps required for reasoning, indicating that pre-trained models struggle with complex logic chains, a gap delineating current systems from human-level understanding.
  5. Robustness Testing: The models exhibit limitation in maintaining consistency when minor modifications to test cases occur, reflecting surface-level commonsense knowledge without deep comprehension. This is particularly evident in tasks involving word addition, deletion, and substitution.

Implications and Future Directions

While the paper shows that pre-trained LLMs possess commendable commonsense abilities, there remains a notable disparity compared to human reasoning, mainly due to challenges with inference step complexity and robustness. Future work in AI should aim at improving model architectures and training methods to bridge this gap. Innovations may focus on more intricate modeling of context or the integration of specific commonsense logic layers that target reasoning tasks. Additionally, expanding the scope of training data to encompass diverse cultural and contextual information may further enhance performance.

Overall, this paper contributes valuable insights into the commendable progress and persistent challenges of commonsense reasoning in AI models, guiding future research directions toward achieving a more robust, human-comparable understanding in computational contexts.

Youtube Logo Streamline Icon: https://streamlinehq.com