Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation (2410.02725v1)

Published 3 Oct 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Inference-time computation is a powerful paradigm to enhance the performance of LLMs, with Best-of-N sampling being a widely used technique. However, this method is computationally expensive, requiring both (1) an external reward model and (2) the generation of multiple samples. In this work, we introduce a new generative self-evaluation scheme designed to adaptively reduce the number of generated samples while maintaining or even improving performance. We use a generative reward model formulation, allowing the LLM to predict mid-generation the probability that restarting the generation will yield a better response. These predictions are obtained without an external reward model and can be used to decide whether or not to generate more samples, prune unpromising samples early on, or to pick the best sample. This capability is very inexpensive as it involves generating a single predefined token. Trained using a dataset constructed with real unfiltered LMSYS user prompts, Llama 3.1 8B's win rate against GPT-4 on AlpacaEval increases from 21% to 34% with 16 samples and math performance on GSM8K improves from 84% to 91%. By sampling only when the LLM determines that it is beneficial to do so and adaptively adjusting temperature annealing, we demonstrate that 74% of the improvement from using 16 samples can be achieved with only 1.2 samples on average. We further demonstrate that 50-75% of samples can be pruned early in generation with minimal degradation in performance. Overall, our methods enable more efficient and scalable compute utilization during inference for LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Rohin Manvi (3 papers)
  2. Anikait Singh (17 papers)
  3. Stefano Ermon (279 papers)
Citations (2)

Summary

Adaptive Inference-Time Compute in LLMs

The paper "Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation" introduces innovative strategies to enhance the computational efficiency of LLMs during inference. Focusing on Best-of-N sampling, the work identifies computational cost reductions and performance improvements through novel self-evaluation techniques. These techniques enable models to dynamically adjust their computational resources based on task complexity and internal evaluations.

Core Contributions

The paper proposes a generative self-evaluation framework for LLMs, allowing models to predict the probability that generating additional samples would yield better responses. This prediction circumvents the need for external reward models, traditionally used in Best-of-N sampling, by leveraging the model's inherent capabilities to perform self-evaluation tasks using token prediction.

A key innovation is the introduction of capability-aware self-evaluation, which allows models to determine the likelihood that the performance can be improved by generating additional outputs. This approach is further refined with mid-generation self-evaluation, enabling the pruning of unpromising samples early in their generation, thereby conserving computational resources.

Methodology and Technical Insights

The process involves appending a predefined self-evaluation prompt to generated responses, predicting whether the model could achieve better results by resampling. This probability is calculated by training the model on a modified dataset derived from on-policy preferences and ties, which uses token likelihoods to evaluate if a response should be retained or discarded.

Two main techniques are proposed:

  1. Adaptive Sampling and Annealing: LLMs utilize adaptive sampling, resampling only when beneficial, informed by self-evaluation probabilities. To mitigate latency, an exponentially increasing parallel sampling approach is suggested, accompanied by a temperature annealing schedule that balances exploitation and exploration.
  2. Early Pruning: This technique allows a reduction in computational costs by ceasing the generation of unpromising samples, identified through mid-generation self-evaluations. As such, only samples with high potential are fully generated, significantly reducing the average computational cost per query.

Experimental Evaluation

The paper evaluates these methods on the AlpacaEval and GSM8K datasets, highlighting significant efficiency gains. Notable improvements are seen with the Llama 3.1 8B model, which, using self-evaluation, increased its win rate against GPT-4 from 21\% to 34\% with just 16 samples, while its accuracy on GSM8K improved from 84\% to 91\%. Importantly, adaptive techniques showed that 74\% of these improvements could be achieved using an average of only 1.2 samples.

Implications and Future Directions

The theoretical implications of this work are profound, suggesting that adaptive compute methods can make LLMs more efficient by aligning computational costs with task demands. Practically, these techniques make the deployment of LLMs more scalable across varied applications, reducing resource use without sacrificing performance.

Looking ahead, further exploration is warranted into optimizing latency introduced by adaptive sampling and investigating broader applications of mid-generation evaluations. Additionally, expanding these frameworks to other inference techniques like beam search could enhance efficiency across diverse tasks, further extending the models' practical usability in real-world scenarios.

In conclusion, the paper presents a significant advance in optimizing LLM inference, introducing adaptive compute strategies that promise both performance gains and computational savings, thereby holding valuable implications for future AI developments.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com