Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models (1907.13528v2)

Published 31 Jul 2019 in cs.CL and cs.AI

Abstract: Pre-training by LLMing has become a popular and successful approach to NLP tasks, but we have yet to understand exactly what linguistic capacities these pre-training processes confer upon models. In this paper we introduce a suite of diagnostics drawn from human language experiments, which allow us to ask targeted questions about the information used by LLMs for generating predictions in context. As a case study, we apply these diagnostics to the popular BERT model, finding that it can generally distinguish good from bad completions involving shared category or role reversal, albeit with less sensitivity than humans, and it robustly retrieves noun hypernyms, but it struggles with challenging inferences and role-based event prediction -- and in particular, it shows clear insensitivity to the contextual impacts of negation.

Citations (575)

Summary

  • The paper introduces a suite of psycholinguistic diagnostics to assess BERT’s linguistic abilities beyond syntax.
  • Findings show that BERT can differentiate completions but struggles with commonsense reasoning, pragmatic inference, and negation.
  • Results underscore the need for advanced training methods to improve context integration and robust semantic role understanding.

Overview of "What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for LLMs"

This paper by Allyson Ettinger addresses the linguistic capacities of pre-trained LLMs, specifically BERT, by utilizing a new suite of diagnostics derived from psycholinguistic experiments. The paper investigates the linguistic knowledge conferred to models during pre-training processes and seeks to uncover what information BERT utilizes when predicting words in context.

Diagnostic Approach

The paper introduces a set of carefully controlled diagnostic tests that originate from human psycholinguistic studies. These diagnostics are designed to evaluate models on linguistic capabilities beyond syntactic understanding, encompassing semantic roles, commonsense reasoning, pragmatic inference, category membership, and negation.

Key Findings

  1. Differentiating Completions:
    • BERT demonstrates some ability to distinguish good from bad completions based on categories and roles but lacks the sensitivity observed in human predictions.
  2. Commonsense and Pragmatic Inference:
    • Through the CPRAG-102 test, BERT exhibits weaknesses in making commonsense and pragmatic inferences, with models often failing to integrate contextual information effectively.
  3. Role Reversal Sensitivity:
    • In the ROLE-88 test, BERT can sometimes discern changes in semantic roles caused by word order reversals, yet fails to consistently demonstrate robust understanding of event-based predictions.
  4. Understanding Negation:
    • The NEG-136 diagnostic reveals a significant limitation: BERT does not appropriately handle negation, frequently misunderstanding the contextual implications of negated statements.

Practical and Theoretical Implications

The findings highlight critical limitations in current LLMs, particularly concerning context integration and inference. These insights suggest areas for enhancement in model training to achieve a deeper understanding of semantic nuances and inferencing skills. The established diagnostics provide a foundation for assessing and comparing advancements in future model architectures.

Future Directions

The paper suggests expanding diagnostic evaluations to cover additional aspects of language processing. Future research could focus on refining models to enhance their understanding and prediction capabilities towards complex linguistic attributes such as inference, negotiation of context, and truth assessment.

Overall, this research underscores the necessity of rigorous diagnostic tools in developing and evaluating robust LLMs and serves as a basis for navigating future developments in NLP and AI.

Youtube Logo Streamline Icon: https://streamlinehq.com