Language models are not naysayers: An analysis of language models on negation benchmarks (2306.08189v1)

Published 14 Jun 2023 in cs.CL

Abstract: Negation has been shown to be a major bottleneck for masked LLMs, such as BERT. However, whether this finding still holds for larger-sized auto-regressive LLMs (``LLMs'') has not been studied comprehensively. With the ever-increasing volume of research and applications of LLMs, we take a step back to evaluate the ability of current-generation LLMs to handle negation, a fundamental linguistic phenomenon that is central to language understanding. We evaluate different LLMs -- including the open-source GPT-neo, GPT-3, and InstructGPT -- against a wide range of negation benchmarks. Through systematic experimentation with varying model sizes and prompts, we show that LLMs have several limitations including insensitivity to the presence of negation, an inability to capture the lexical semantics of negation, and a failure to reason under negation.

PDF HTML Abstract

Analysis of LLMs on Negation Benchmarks

Negation, a fundamental linguistic construct, continues to be a major hurdle for neural LLMs (NLMs), especially masked LLMs like BERT. However, the current generation of LLMs, typified by auto-regressive architectures such as GPT-3 and its derivatives, have not been comprehensively evaluated in their ability to process negation. This paper provides a detailed analysis of various LLMs, examining their performance against a suite of negation-focused benchmarks. The research seeks to ascertain whether larger models have mitigated the inadequacies in handling negation observed in their predecessors.

Key Findings

Increased Insensitivity with Size: The analysis reveals an inverse correlation between model size and sensitivity to negation. Larger models struggle more with negation than smaller ones, as evidenced by poorer performance on negation tasks like MKR-NQ, where their predictions tend to ignore the presence of negation, leading to frequent incorrect completions.
Failure to Capture Lexical Semantics: In tasks designed to test the models’ grasp of antonym and synonym relationships, such as the MWR dataset, LLMs showcase near-random performance. This suggests a lack of understanding in basic lexical negation relations, a critical competency for nuanced language interpretation.
Reasoning Descrepancies under Negation: LLMs exhibit significant limitations in reasoning under conditions of negation. Across several natural language inference (NLI) benchmarks where negation plays a pivotal role, models perform worse than random guessing, highlighting an inability to handle negation effectively.
Instruction Fine-Tuning Shows Promise: Notably, models like InstructGPT, which incorporate instruction fine-tuning, display improved performance on negation benchmarks, suggesting that augmenting training paradigms can enhance model capabilities in this domain. This observation underscores the potential of instruction fine-tuning over merely scaling models up.

Methodology and Experimental Setup

The experimental framework encompasses a variety of benchmark datasets specifically curated for negation. These include cloze completion tasks (e.g., MKR-NQ) and NLI tasks with crucial negation implications (e.g., NegNLI, MoNLI, and NaN-NLI). The paper assesses open-source models like GPT-Neo and OPT, alongside commercial models including GPT-3 and InstructGPT. Prompt-based evaluation tactics are utilized, exploring variations to prime models for detecting negation. Performance metrics include the Weighted Hit Rate for cloze tasks and accuracy for classification tasks.

Implications and Future Directions

The findings of this research have several implications for both practical applications and theoretical understanding of LLMs:

Practical Application: The limitations of current LLMs in processing negation could lead to errors in applications requiring precise language understanding, such as dialogue systems and automated reasoning tools. Addressing these issues could substantially boost the reliability of such systems.
Theoretical Development: From a theoretical standpoint, these results highlight gaps in the current architectures' linguistic understanding, indicating the necessity for diversified training objectives or perhaps novel model architectures that inherently accommodate linguistic nuances like negation.
Instruction Fine-Tuning: As instruction-tuning has shown substantial improvements, future work should further explore instruction fine-tuning strategies, potentially in combination with reinforcement learning techniques, to align LLM behavior more closely with complex reasoning tasks.

Conclusion

This paper emphasizes that despite advancements in LLM architecture and training, negation remains a persistent challenge. While model scaling does not address this issue, the paper illustrates the promising role of instruction-tuning in bridging current gaps. Going forward, the research community is encouraged to explore neglected language phenomena and innovate on training processes to better equip models for robust language understanding.