CONTESTS: a Framework for Consistency Testing of Span Probabilities in Language Models (2409.19984v1)

Published 30 Sep 2024 in cs.CL and cs.AI

Abstract: Although LLM scores are often treated as probabilities, their reliability as probability estimators has mainly been studied through calibration, overlooking other aspects. In particular, it is unclear whether LLMs produce the same value for different ways of assigning joint probabilities to word spans. Our work introduces a novel framework, ConTestS (Consistency Testing over Spans), involving statistical tests to assess score consistency across interchangeable completion and conditioning orders. We conduct experiments on post-release real and synthetic data to eliminate training effects. Our findings reveal that both Masked LLMs (MLMs) and autoregressive models exhibit inconsistent predictions, with autoregressive models showing larger discrepancies. Larger MLMs tend to produce more consistent predictions, while autoregressive models show the opposite trend. Moreover, for both model types, prediction entropies offer insights into the true word span likelihood and therefore can aid in selecting optimal decoding strategies. The inconsistencies revealed by our analysis, as well their connection to prediction entropies and differences between model types, can serve as useful guides for future research on addressing these limitations.

Summary

The paper introduces ConTestS, a framework that systematically evaluates the consistency of span probability estimates in language models using paired Wilcoxon rank tests.
It finds that while larger masked language models improve prediction consistency, larger autoregressive models exhibit increased inconsistencies.
The study reveals that prediction entropies indicate estimation accuracy, offering actionable insights to refine decoding strategies in NLP applications.

Consistency Testing of Span Probabilities in LLMs

The paper introduces a framework named ConTestS (Consistency Testing over Spans) that systematically evaluates the consistency of probabilities generated by LLMs (LMs) when assigning joint probabilities to word spans. This work goes beyond traditional calibration methods to scrutinize whether LLMs deliver consistent probability values across different orders of conditioning and completions—a critical assessment, considering the prevailing use of LLM scores as estimative probabilities in various practical applications.

Framework Methodology

The authors developed ConTestS to quantify the consistency of probability estimates, focusing on both Masked LLMs (MLMs) and autoregressive models. They employed statistical tests to examine score consistency while utilizing post-release real and synthetic datasets to eliminate training artifacts. Key features of this framework include:

Task Comprehension in Autoregressive Models: Adjusts for the characteristic that autoregressive models were not explicitly designed for masked LLMing tasks.
Statistical Analysis: Utilizes paired two-sided Wilcoxon rank tests to ascertain if discrepancies (differences in span probability estimates) are symmetrically distributed around zero. This approach helps screen for significant inconsistencies.
Data Gathering: Deploys datasets not included in the model's training set to ensure unbiased evaluation.

Main Findings

The investigation demonstrated that both MLMs and autoregressive models exhibit inconsistent predictions, with autoregressive models showing pronounced discrepancies. Interestingly, larger MLMs provided improved prediction consistency, whereas larger autoregressive models demonstrated increased inconsistencies. This divergent trend between model size and performance emphasizes the varied underlying mechanics of these model types.

Notably, the paper found that prediction entropies are telling indicators of a model's estimation accuracy and can be advantageous in refining decoding strategies. The implications suggest that higher entropy for a single-mask prediction paired with lower entropy for a double-mask may yield better token predictions.

Implications and Future Directions

The paper underscores the complexity of relying on LLM outputs as probabilistic distributions, highlighting systemic inconsistencies. Given that joint probabilities are crucial for applications such as robust ranking and sequence generation, these findings suggest a critical need for improved model design or post-processing strategies to ensure probabilistic reliability.

The work sets a foundation for future research in AI and NLP, encouraging exploration into model training and architecture adjustments to tackle identified deficiencies. The systemic insights into entropy correlations and model scale effects can guide developers in integrating probabilistic reasoning with current deep learning frameworks, potentially benefiting broader generative modeling tasks.

Further exploration could involve extending the testing methodology to account for more intricate spans and diverse linguistic structures, ultimately enhancing the fidelity and interpretability of LLMs as probabilistic systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/YuliSlavutsky/status/1842962469385540090

https://twitter.com/EitanWagner/status/1842925012552155558