Surface Form Competition: Why the Highest Probability Answer Isn't Always Right

Published 16 Apr 2021 in cs.CL | (2104.08315v9)

Abstract: LLMs have shown promising results in zero-shot settings (Brown et al.,2020; Radford et al., 2019). For example, they can perform multiple choice tasks simply by conditioning on a question and selecting the answer with the highest probability. However, ranking by string probability can be problematic due to surface form competition-wherein different surface forms compete for probability mass, even if they represent the same underlying concept, e.g. "computer" and "PC." Since probability mass is finite, this lowers the probability of the correct answer, due to competition from other strings that are valid answers (but not one of the multiple choice options). We introduce Domain Conditional Pointwise Mutual Information, an alternative scoring function that directly compensates for surface form competition by simply reweighing each option according to a term that is proportional to its a priori likelihood within the context of the specific zero-shot task. It achieves consistent gains in zero-shot performance over both calibrated (Zhao et al., 2021) and uncalibrated scoring functions on all GPT-2 and GPT-3 models over a variety of multiple choice datasets.

Abstract PDF Upgrade to Chat

Citations (215)

View on Semantic Scholar

Summary

The paper highlights that surface form competition causes valid answers to be downweighted due to lexical variations like synonyms.
It introduces Domain Conditional Pointwise Mutual Information (PMI) as a normalized scoring method that better reflects contextual relevance.
Empirical evaluations demonstrate that PMI enhances zero-shot multiple-choice tasks across various LLM sizes without extra computational overhead.

Exploring Surface Form Competition and Domain Conditional Pointwise Mutual Information

The paper by Holtzman et al. proposes an insightful examination of zero-shot prediction efficacy within LLMs, specifically addressing the competitive dynamics of surface forms in these models and introducing the concept of Domain Conditional Pointwise Mutual Information (PMI) as an improved scoring methodology.

Surface Form Competition

This research highlights a critical drawback in conventional probability-based scoring methods used by generative models like GPT-2 and GPT-3 for multiple-choice tasks: surface form competition. Here, LLMs allocate probability mass across all possible surface forms of a given concept, leading to inefficient distribution. As a result, many plausible answers are unfairly downweighted simply due to lexical variances such as synonymy or minor phrasing differences. The paper illustrates this problem through examples from datasets like CommonsenseQA, where options like "Whirlpool bath" can be overshadowed by synonyms such as "Bathtub" based on their occurrence frequency in the pretraining data.

Introduction of Domain Conditional PMI

To counteract this skew caused by surface form competition, the authors propose the use of Domain Conditional Pointwise Mutual Information. Traditional scoring, which leans on direct likelihood measures, constrains the correct answers by weighing them similarly to more frequent but synonymous surface forms. PMI introduces a normalization element, accounting for the inherent likelihood of surface forms within the domain, effectively enhancing scoring precision. This methodology adjusts the scoring paradigm to better reflect the contextual relevance and informativeness of each potential answer relative to the question, disambiguating semantic intent from lexical peculiarities.

Empirical Evaluation

The study employs a range of multiple-choice datasets, displaying consistent performance enhancements with PMI across various model sizes and datasets when compared against standard methods, including calibrated and uncalibrated scoring strategies. Notably, PMI offers substantial improvements in zero-shot scenarios on tasks traditionally tripped by surface form competition without demanding additional computational overhead or training modifications.

Implications and Future Directions

The findings extend profound implications for zero-shot and few-shot learning tasks, especially in domains where textual surface variations abound. The presentation of Domain Conditional PMI challenges the current paradigms of response generation and scoring, pointing toward a more nuanced approach capable of maintaining semantic neutrality notwithstanding surface variation. In practical applications, this creates prospects for more robust and adaptable AI systems across diverse informational contexts.

Theoretical contributions include highlighting the detriment of surface form competition as a context-independent hindrance within LLMs, alongside proposing an advanced scoring framework that operators within the constraints of current computational capabilities.

Future work could explore further refinements of PMI, integrating adaptive domain sensitivity measures that dynamically adjust based on in-domain training or inference cues, potentially extending its application to broader generative tasks beyond multiple-choice queries. Additionally, this research may stimulate new model design considerations where surface form distributions are more explicitly normalized during pretraining, fostering more contextually aware language processing capabilities.