Exploring Predictive Uncertainty and Calibration in NLP: A Study on the Impact of Method & Data Scarcity (2210.15452v1)

Published 20 Oct 2022 in cs.CL, cs.AI, and cs.LG

Abstract: We investigate the problem of determining the predictive confidence (or, conversely, uncertainty) of a neural classifier through the lens of low-resource languages. By training models on sub-sampled datasets in three different languages, we assess the quality of estimates from a wide array of approaches and their dependence on the amount of available data. We find that while approaches based on pre-trained models and ensembles achieve the best results overall, the quality of uncertainty estimates can surprisingly suffer with more data. We also perform a qualitative analysis of uncertainties on sequences, discovering that a model's total uncertainty seems to be influenced to a large degree by its data uncertainty, not model uncertainty. All model implementations are open-sourced in a software package.

PDF Abstract

This paper investigates the critical aspects of predictive uncertainty estimation and calibration within NLP, specifically examining how different quantification methods perform under varying levels of data availability, with a focus on low-resource language scenarios (Ulmer et al., 2022 ). The research aims to understand the reliability of confidence scores produced by neural classifiers, a factor crucial for deploying trustworthy NLP systems, especially in domains requiring high safety or handling ambiguity. The investigation centers on three primary research questions: (RQ1) Identifying the most effective approaches for uncertainty quality and calibration, (RQ2) Understanding the impact of training data scarcity on model performance and uncertainty estimation, and (RQ3) Characterizing the qualitative differences in how various methods estimate uncertainty.

Methodology and Experimental Setup

The paper employs a rigorous experimental design to compare uncertainty quantification techniques across different model architectures and data regimes.

Models Evaluated

Eight distinct models were selected, spanning two major architectural families:

LSTM-based:
- Standard LSTM (Baseline)
- Variational LSTM (MC Dropout)
- Bayesian LSTM (Bayes-by-backprop)
- ST-τ LSTM (Finite-state automaton transition probabilities)
- LSTM Ensemble (Deep Ensemble)
Transformer-based (BERT variants):
- Variational Transformer (MC Dropout on BERT)
- SNGP Transformer (Spectral-normalized Neural Gaussian Process on BERT embeddings)
- DDU Transformer (Deep Deterministic Uncertainty using a Gaussian Mixture Model on BERT embeddings)

LSTM models were trained from scratch, while Transformer models utilized pre-trained checkpoints (BERT-base-uncased, bert-base-danish-uncased, FinBERT) and were fine-tuned on the target tasks.

Datasets and Tasks

Experiments were conducted across three distinct languages and tasks to ensure generality and simulate diverse resource conditions:

English: Intent Classification (Clinc Plus dataset, including out-of-scope intents for OOD evaluation).
Danish: Named Entity Recognition (Dan+ News dataset, using Danish Twitter data for OOD evaluation).
Finnish: Part-of-Speech Tagging (Finnish UD Treebank, using diverse genres like medical records, legal texts, and poetry for OOD evaluation).

Data Scarcity Simulation

To address RQ2, training datasets were sub-sampled systematically. A sampling scheme was designed to create datasets of varying sizes while preserving the original distribution of sequence lengths and class labels as much as possible. This allowed for controlled analysis of the effect of training data volume.

Uncertainty and Calibration Metrics

A comprehensive suite of metrics was used for evaluation:

Task Performance: Accuracy and Macro F1-score on both in-distribution (ID) and out-of-distribution (OOD) test sets.
Calibration:
- Expected Calibration Error (ECE)
- Adaptive Calibration Error (ACE)
- Coverage Percentage (at 95% confidence level)
- Average Prediction Set Width
Uncertainty Quality:
- ID/OOD Discrimination: Area Under the ROC Curve (AUROC) and Area Under the Precision-Recall Curve (AUPR) based on uncertainty scores classifying ID vs. OOD samples.
- Correlation with Error: Kendall's Tau (τ) correlation coefficient between uncertainty scores and model loss (negative log-likelihood), measured at both the token level (Token τ) and sequence level (Sequence τ).
Uncertainty Metrics: Various scores were computed depending on the model:
- Single prediction: Maximum softmax probability, softmax-gap, predictive entropy, Dempster-Shafer metric.
- Multiple predictions (Ensembles, MC Dropout, Bayesian): Predictive variance, mutual information (approximating epistemic uncertainty).
- Model-specific: Log-probability under the GMM (for DDU).
- Sequence Uncertainty: Mean of token-level uncertainties.

Statistical significance for model comparisons was assessed using the Almost Stochastic Order (ASO) test.

Key Findings and Analysis

The experimental results provide nuanced insights into the performance and behavior of different uncertainty quantification methods under data scarcity.

RQ1: Best Approaches for Uncertainty Quality and Calibration

In-Distribution (ID) Performance: Transformer-based models fine-tuned on sufficient data (particularly DDU BERT) and LSTM Ensembles generally demonstrated superior task performance (Accuracy, F1) and better calibration metrics (lower ECE/ACE, better coverage/width trade-off) compared to other methods.
Out-of-Distribution (OOD) Performance: The picture is more complex for OOD data. While Transformer-based models often excelled at discriminating between ID and OOD samples (high AUROC/AUPR), their uncertainty scores showed a weaker correlation with actual model errors on OOD data (lower Kendall's Tau) compared to some LSTM-based approaches, especially LSTM Ensembles. This suggests that high OOD detection capability does not necessarily translate to reliable uncertainty estimation regarding prediction correctness on those OOD samples.
LSTM Ensembles: Deep Ensembles of LSTMs emerged as a consistently strong and robust method, frequently matching or surpassing single fine-tuned Transformers in task performance and demonstrating more reliable uncertainty-error correlation, particularly on OOD data.
Uncertainty Metrics: No single uncertainty metric (e.g., entropy, max probability, variance) was universally superior. Metrics directly derived from the output distribution (max probability, softmax gap, Dempster-Shafer) often exhibited stronger correlations with token-level errors.

RQ2: Impact of Data Scarcity

General Trend: As expected, increasing the amount of training data generally led to improvements in task performance and the quality of uncertainty estimates on ID data, especially for pre-trained models undergoing fine-tuning.
Paradoxical OOD Effect: A significant and counter-intuitive finding emerged for pre-trained Transformer models: increasing the amount of fine-tuning data often resulted in a decrease in the quality of uncertainty estimates on OOD data. Specifically, the correlation between the model's uncertainty score and its prediction error on OOD samples (measured by Kendall's Tau) weakened as more ID fine-tuning data was used. This degradation was less pronounced or absent for LSTM models trained from scratch.
Implication: This suggests that extensive fine-tuning on ID data might cause pre-trained models to become overly specialized, potentially losing representations useful for assessing uncertainty on OOD inputs or making their confidence less indicative of correctness when faced with distribution shifts. Models trained from scratch might maintain more generalizable uncertainty representations relative to their performance level.

RQ3: Qualitative Differences in Uncertainty Estimation

Sequence-Level Uncertainty: Qualitative inspection showed some agreement across models, such as lower uncertainty assigned to punctuation and higher uncertainty to sub-word tokens or less frequent words.
Uncertainty Decomposition: An attempt to decompose total uncertainty (predictive entropy) into aleatoric (data uncertainty) and epistemic (model uncertainty) components, using mutual information as a proxy for the latter, indicated that total uncertainty was often dominated by the aleatoric component. This implies that inherent data ambiguity or noise frequently plays a larger role in the overall predictive uncertainty than the model's parameter uncertainty, particularly for methods like MC Dropout and Bayesian LSTMs evaluated here. Epistemic uncertainty, while smaller, showed more variability across different models and methods as anticipated.

Implications for Practitioners

The paper's findings have direct implications for applying uncertainty quantification in real-world NLP systems:

Method Selection: The choice of uncertainty method should depend heavily on the operational context. For applications primarily concerned with ID performance and calibration, fine-tuned Transformers (like DDU BERT) or Ensembles are strong candidates, provided sufficient training data. However, if reliable uncertainty estimation under potential distribution shifts (OOD) is critical, LSTM Ensembles might offer greater robustness, even if their peak ID performance is slightly lower. Relying solely on OOD detection metrics (AUROC/AUPR) from fine-tuned models can be misleading about their error prediction capability on OOD data.
Data Availability: In low-resource scenarios, LSTM Ensembles can be particularly effective, potentially outperforming fine-tuned Transformers trained on limited data. The paradoxical effect observed with fine-tuning suggests caution: simply adding more ID fine-tuning data might improve ID performance but could compromise the reliability of uncertainty signals for OOD inputs. Evaluating OOD uncertainty quality (e.g., using Kendall's Tau with OOD errors) is crucial, especially when fine-tuning pre-trained models extensively.
Uncertainty Interpretation: The dominance of aleatoric uncertainty in total predictive uncertainty suggests fundamental limits to how much model-focused techniques (like MC Dropout or Bayesian methods estimating epistemic uncertainty) can improve overall uncertainty quality in certain NLP tasks. The inherent ambiguity or variability in the language data itself might be the primary driver of uncertainty.

Conclusion

This research provides a valuable empirical analysis of predictive uncertainty and calibration methods in NLP across diverse models, languages, and data regimes (Ulmer et al., 2022 ). It highlights the strengths of fine-tuned transformers and ensembles for ID tasks but reveals a critical vulnerability: the degradation of OOD uncertainty quality with increased fine-tuning data for pre-trained models. LSTM ensembles emerge as a robust alternative, particularly when OOD reliability or data scarcity is a concern. The findings underscore the necessity of evaluating uncertainty not just through calibration and OOD detection metrics but also through direct correlation with model errors, especially under distribution shifts, to build genuinely reliable NLP systems.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Dennis Ulmer (17 papers)
Jes Frellsen (43 papers)
Christian Hardmeier (20 papers)

Citations (20)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/dnnslmr/status/1753320619054109146

https://twitter.com/dnnslmr/status/1765027081304354979