Language Model Cascades: Token-level uncertainty and beyond (2404.10136v1)

Published 15 Apr 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Recent advances in LLMs (LMs) have led to significant improvements in quality on complex NLP tasks, but at the expense of increased inference costs. Cascading offers a simple strategy to achieve more favorable cost-quality tradeoffs: here, a small model is invoked for most "easy" instances, while a few "hard" instances are deferred to the large model. While the principles underpinning cascading are well-studied for classification tasks - with deferral based on predicted class uncertainty favored theoretically and practically - a similar understanding is lacking for generative LM tasks. In this work, we initiate a systematic study of deferral rules for LM cascades. We begin by examining the natural extension of predicted class uncertainty to generative LM tasks, namely, the predicted sequence uncertainty. We show that this measure suffers from the length bias problem, either over- or under-emphasizing outputs based on their lengths. This is because LMs produce a sequence of uncertainty values, one for each output token; and moreover, the number of output tokens is variable across examples. To mitigate this issue, we propose to exploit the richer token-level uncertainty information implicit in generative LMs. We argue that naive predicted sequence uncertainty corresponds to a simple aggregation of these uncertainties. By contrast, we show that incorporating token-level uncertainty through learned post-hoc deferral rules can significantly outperform such simple aggregation strategies, via experiments on a range of natural language benchmarks with FLAN-T5 models. We further show that incorporating embeddings from the smaller model and intermediate layers of the larger model can give an additional boost in the overall cost-quality tradeoff.

Citations (23)

View on Semantic Scholar

Summary

The paper introduces a novel quantile-based deferral rule that enhances inference efficiency in LM cascades by pinpointing token-level uncertainties.
It employs a learned post-hoc model integrating quantile features and intermediate embeddings to outperform traditional sum and average measures.
The study provides actionable insights for balancing computational cost with performance, paving the way for adaptive generative modeling.

Understanding Deferral Rules in Generative LLM Cascades

Introduction to LLM Cascades

LLMs (LMs), particularly those based on Transformer architectures, have demonstrated significant advancements across a spectrum of NLP tasks. The computational cost associated with these models has risen correspondingly, prompting research into cost-efficient inference strategies. Among these strategies, cascades stand out for their simplicity and effectiveness. A cascade utilizes a smaller, less resource-intensive model to handle simpler instances and relies on a larger, more sophisticated model for more complex ones. This approach promises a favorable cost-quality tradeoff by leveraging adaptive inference, where the decision to escalate an instance from the smaller to the larger model is governed by a deferral rule.

Challenges in Deferral Rules for Generative LMs

Deferral rules, designed to gauge the necessity of engaging a larger model, have historically pivoted around the model's confidence in its prediction. Such confidence is straightforward to ascertain in classification tasks, where it typically corresponds to the softmax probability of the predicted class. However, the extension of this concept to generative LMs, which deal with sequence generation rather than fixed-class classification, introduces complexity. The softmax probability of a generated sequence, while a direct analog, falls short due to its sensitivity to sequence length. This length bias leads to an indiscriminate deferral of longer sequences and a disproportional deference to shorter ones when adjustments for length are attempted. Addressing this challenge requires a nuanced approach that accounts for the probabilistic nature of each token in the sequence and the variance in sequence lengths.

Innovations in Deferral Rules: Quantiles as a Solution

Quantiles of the vector of token-level uncertainties emerge as a powerful alternative to aggregate measures like sum or average. They offer a granular view of uncertainty across a sequence, allowing for a more discriminative identification of instances warranting deferral. Initial empirical assessments validate the superiority of quantile-based deferral rules over their sum and average counterparts across various NLP benchmarks. Nevertheless, the variability in optimal quantile values across tasks underscores the need for a dynamic, context-aware deferral strategy.

Towards a Post-hoc Learned Deferral Rule

Building on the promise of quantiles, we propose a learned post-hoc deferral rule that utilizes these quantiles as features. This model is trained to discern when deferment to a larger model is beneficial, effectively synthesizing insights drawn from various quantile measures. Preliminary results are promising, showcasing a consistent performance uplift relative to non-learned baselines.

Moreover, the integration of intermediate embeddings from the larger model into the deferral rule presents an intriguing avenue. While incorporating such embeddings incurs additional computational overhead, the resultant performance gains, particularly in tasks with succinct answers, highlight a valuable tradeoff between inference cost and deferral accuracy.

Implications and Future Directions

The exploration of deferral rules in LM cascades signals a key step towards computationally efficient LLMing. This research illustrates the limitations of straightforward confidence measures in the context of generative tasks and showcases the potential of more sophisticated, learned approaches. The findings encourage further investigation into the optimal harnessing of model internals, such as intermediate embeddings, to refine the deferral decision process.

Future work may explore alternative architectures for the deferral rule model and extend the evaluation to encompass a wider range of LMs, including those with differing pre-training regimes or architectural nuances. Additionally, the impact of fine-tuning strategies on model calibration and, by extension, on deferral rule efficacy merits further examination. This line of inquiry not only deepens our understanding of adaptive inference in LLMing but also guides the development of LMs that balance the dual imperatives of performance and computational efficiency.

PDF Markdown

Related Papers

Tweets

https://twitter.com/fly51fly/status/1780590280753967439