Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Do Not Design, Learn: A Trainable Scoring Function for Uncertainty Estimation in Generative LLMs (2406.11278v2)

Published 17 Jun 2024 in cs.CL

Abstract: Uncertainty estimation (UE) of generative LLMs is crucial for evaluating the reliability of generated sequences. A significant subset of UE methods utilize token probabilities to assess uncertainty, aggregating multiple token probabilities into a single UE score using a scoring function. Existing scoring functions for probability-based UE, such as length-normalized scoring and semantic contribution-based weighting, are designed to solve certain aspects of the problem but exhibit limitations, including the inability to handle biased probabilities and complex semantic dependencies between tokens. To address these issues, in this work, we propose Learnable Response Scoring (LARS) function, a novel scoring function that leverages supervised data to capture complex dependencies between tokens and probabilities, thereby producing more reliable and calibrated response scores in computing the uncertainty of LLM generations. Our comprehensive experiments across question-answering and arithmetical reasoning tasks with various datasets demonstrate that LARS significantly outperforms existing scoring functions, achieving improvements of up to 16\% AUROC score.

Learnable Scoring Function for Uncertainty Estimation in Generative LLMs

The paper "Do Not Design, Learn: A Trainable Scoring Function for Uncertainty Estimation in Generative LLMs" by Yaldiz et al. presents a novel approach to Uncertainty Estimation (UE) in generative LLMs through a learned scoring function named Learnable Response Scoring Function (LARS). This contributes to improving the reliability of probability-based UE methods over traditional heuristic designs.

Introduction

Generative LLMs have transformed various domains through their advanced language comprehension and generation capabilities. However, estimating the uncertainty of their responses remains critical to mitigate potentially misleading outputs. Common UE methods fall into two primary categories: probability-based, leveraging token probabilities, and non-probability-based, reliant on heuristics. In this work, the authors focus on enhancing probability-based UE methods by addressing the shortcomings of existing scoring functions like Length-Normalized Scoring (LNS), MARS, and TokenSAR.

Key Contributions

  1. Critique and Analysis of Current Scoring Functions: The paper provides a comprehensive critique of existing scoring functions, highlighting three main issues:
    • Manual Design Limitations: The intricacies of modeling dependencies between tokens and probabilities are not fully addressed by heuristic scoring functions.
    • Biased Probabilities: Current methods fail to account for biases inherent in token probabilities, especially towards specific entities.
    • Challenges with Low-Resource Languages: Techniques designed around English languish in morphologically distinct languages, as evidenced by the lackluster performance in Turkish.
  2. Introduction of LARS: To overcome these limitations, LARS is proposed as an off-the-shelf, data-driven scoring function. It utilizes supervised data to learn a function that better captures the nuances in token probability correlations with textual input and generated sequence.
  3. Empirical Validation: The paper demonstrates through rigorous experimentation across various datasets that LARS significantly outperforms existing methods in terms of AUROC scores, thereby providing more precise UE.

Methodology

The LARS model is built on top of a pre-trained RoBERTa model, with an additional linear layer fine-tuned using calibration datasets derived from multiple model generations. The inputs to LARS are sequences comprising question tokens, generated answer tokens, and corresponding probability tokens. These probabilities are represented through a few-hot vector encoding based on quantile partitioning.

Experiments reveal that LARS maintains superior calibration and performance, especially in low-resource scenarios, indicating its robustness across LLMs with diverse training footprints.

Experimental Setup

The experiments span UA tests over three datasets: TriviaQA, NaturalQA, and WebQA, with LARS models trained individually for each dataset and model combination. The results indicate that LARS provides a substantial improvement in UE metrics over traditional methods, even when retrained on out-of-distribution data, and demonstrates robustness across different LLM paradigms.

Results

The primary findings from the experiments are:

  • AUROC Scores: LARS surpasses state-of-the-art scoring functions like LNS, MARS, and TokenSAR by significant margins across multiple datasets.
  • Scalability: LARS exhibits scalability with an increasing number of questions in the calibration dataset, maintaining or improving performance with larger datasets.
  • Entity Bias Correction: By integrating learned probabilities, LARS effectively recalibrates biased probability distributions, leading to enhanced accuracy particularly for entity-specific responses.

Implications

Theoretical implications of this research rest in demonstrating the viability of learning-based scoring functions over heuristic designs, which can adapt to complex token dependencies and varied language constructs. Practically, this enables more reliable and scalable UE in high-stakes applications of LLMs across diverse linguistic and contextual setups.

Future Directions

Future research can further investigate:

  • Further Scaling: Extending the data calibration scales to test the limits of LARS.
  • Noisy Label Resilience: Exploring LARS under diverse noisy labeling environments to enhance robustness.
  • Integrative Models: Combining LARS with complementary non-probability-based UX approaches to further broaden its applicability and accuracy.

Conclusion

LARS presents a compelling, data-driven alternative to manually crafted scoring functions for UE in generative LLMs. By directly learning from data, LARS offers superior performance, scalable calibration, and robustness across linguistic domains, setting a new standard for UE in generative LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Duygu Nur Yaldiz (9 papers)
  2. Yavuz Faruk Bakman (7 papers)
  3. Baturalp Buyukates (26 papers)
  4. Chenyang Tao (29 papers)
  5. Anil Ramakrishna (23 papers)
  6. Dimitrios Dimitriadis (32 papers)
  7. Salman Avestimehr (116 papers)
  8. Jieyu Zhao (54 papers)