- The paper introduces ConTestS, a framework that systematically evaluates the consistency of span probability estimates in language models using paired Wilcoxon rank tests.
- It finds that while larger masked language models improve prediction consistency, larger autoregressive models exhibit increased inconsistencies.
- The study reveals that prediction entropies indicate estimation accuracy, offering actionable insights to refine decoding strategies in NLP applications.
Consistency Testing of Span Probabilities in LLMs
The paper introduces a framework named ConTestS (Consistency Testing over Spans) that systematically evaluates the consistency of probabilities generated by LLMs (LMs) when assigning joint probabilities to word spans. This work goes beyond traditional calibration methods to scrutinize whether LLMs deliver consistent probability values across different orders of conditioning and completions—a critical assessment, considering the prevailing use of LLM scores as estimative probabilities in various practical applications.
Framework Methodology
The authors developed ConTestS to quantify the consistency of probability estimates, focusing on both Masked LLMs (MLMs) and autoregressive models. They employed statistical tests to examine score consistency while utilizing post-release real and synthetic datasets to eliminate training artifacts. Key features of this framework include:
- Task Comprehension in Autoregressive Models: Adjusts for the characteristic that autoregressive models were not explicitly designed for masked LLMing tasks.
- Statistical Analysis: Utilizes paired two-sided Wilcoxon rank tests to ascertain if discrepancies (differences in span probability estimates) are symmetrically distributed around zero. This approach helps screen for significant inconsistencies.
- Data Gathering: Deploys datasets not included in the model's training set to ensure unbiased evaluation.
Main Findings
The investigation demonstrated that both MLMs and autoregressive models exhibit inconsistent predictions, with autoregressive models showing pronounced discrepancies. Interestingly, larger MLMs provided improved prediction consistency, whereas larger autoregressive models demonstrated increased inconsistencies. This divergent trend between model size and performance emphasizes the varied underlying mechanics of these model types.
Notably, the paper found that prediction entropies are telling indicators of a model's estimation accuracy and can be advantageous in refining decoding strategies. The implications suggest that higher entropy for a single-mask prediction paired with lower entropy for a double-mask may yield better token predictions.
Implications and Future Directions
The paper underscores the complexity of relying on LLM outputs as probabilistic distributions, highlighting systemic inconsistencies. Given that joint probabilities are crucial for applications such as robust ranking and sequence generation, these findings suggest a critical need for improved model design or post-processing strategies to ensure probabilistic reliability.
The work sets a foundation for future research in AI and NLP, encouraging exploration into model training and architecture adjustments to tackle identified deficiencies. The systemic insights into entropy correlations and model scale effects can guide developers in integrating probabilistic reasoning with current deep learning frameworks, potentially benefiting broader generative modeling tasks.
Further exploration could involve extending the testing methodology to account for more intricate spans and diverse linguistic structures, ultimately enhancing the fidelity and interpretability of LLMs as probabilistic systems.