Bi-directional Model Cascading with Proxy Confidence (2504.19391v1)

Published 27 Apr 2025 in cs.LG

Abstract: Model Cascading, recently applied successfully to LLMs, is a simple but powerful technique that improves the efficiency of inference by selectively applying models of varying sizes. Models are used in sequence from smallest to largest, only deferring samples to large, costly models when smaller models are not sufficiently confident. Existing approaches to deferral use only limited small model confidence estimates because of the inaccessibility of the large model, although large model confidence is known to be important. We therefore propose a bi-directional approach to deferral that considers the confidence of small and large models in the cascade simultaneously through the use of a proxy for the large model. This requires a richer representation of model confidence to enable comparative calibration: we use an analysis of hidden states to improve post-invocation confidence of the small model, which in itself improves cascading results over prior approaches. We then combine this with a tiny proxy model to estimate pre-invocation confidence of the large model. We examine the proposed cascading system over challenging, multiple-choice datasets, finding improvements over standard cascading baselines reflected in reductions in deferrals to more costly models.

Summary

Bi-directional Model Cascading with Proxy Confidence: An Overview

The paper "Bi-directional Model Cascading with Proxy Confidence" introduces an innovative approach to model cascading aimed at enhancing the efficiency and performance of NLP systems using LLMs. The paper discusses a methodology for integrating small and large models in a cascade configuration, using a novel proxy confidence mechanism that evaluates the confidence of both models simultaneously.

Methodology and Approach

Cascading models is a strategy employed to optimize computational efficiency by deciding which model in a sequence should handle a particular inference task. Traditionally, cascading defers samples to larger models based on the smaller model's confidence in its predictions. However, the paper identifies limitations in relying solely on the small model's confidence estimates. Particularly, it notes failures in accurately degenerating samples where larger models might perform worse despite their complexity.

To address this, the authors propose a bi-directional approach where confidence measures from both the small model and the large model are considered, aided by a proxy—a tiny model designed to predict large model outputs. The proxy model, inspired by recent advancements in LLM calibration, uses layer-wise probability distributions of the smaller model's internal states and combines them with the predicted confidence of the larger model, significantly improving the calibration process.

Technical Contributions

The paper's core ideas revolve around an enhanced representation of model confidence. Specifically:

Backwards Confidence Estimation: It uses an analysis of hidden states from the smaller model to improve post-invocation confidence, leveraging pseudo-probabilities extracted from multiple intermediate layers rather than just the final output layer.
Forwards Confidence Estimation: A pre-invocation estimation using a proxy or auxiliary model predicts the confidence of the larger model without running expensive model evaluations. This is particularly useful in cascade systems where querying larger models is resource-intensive.
Deferral Decision Model: A meta-model evaluates when to defer tasks to the larger model, integrating these richer confidence representations to optimize performance.

Experimental Evaluation

The authors conducted experiments using multiple-choice question-answering benchmarks, including BoolQ, ARC-Easy, ARC-Challenge, MMLU, and CSQA datasets. The proposed bi-directional model consistently outperformed the baseline cascade methods that relied solely on small model confidences. Specifically, results indicated reductions in data deferrals to more costly large models by up to 42.5% while maintaining or improving overall system accuracy.

These outcomes suggest the effectiveness of incorporating dual-model confidences, particularly at low deferral rates, where most gains in computational efficiency can be realized.

Implications and Future Directions

On a practical level, this approach can significantly decrease the environmental and economic costs associated with deploying LLMs by reducing unnecessary computation. The theoretical implications extend into improved calibration techniques, potentially advancing more trustworthy AI applications where model reliability and confidence are critical.

The paper suggests future research directions into using cascades of heterogeneous models and extending techniques from classification tasks to generative tasks. Improving proxy model accuracy stands as another potential area of exploration, which could further lower compute requirements while maintaining precision.

In summary, the paper presents a methodical enhancement to the classic model cascade approach, offering a comprehensive strategy to further the efficiency and accuracy of NLP systems through intelligent confidence evaluation and deferral decisions.