When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs (2506.20544v1)

Published 25 Jun 2025 in cs.CL and cs.AI

Abstract: Recent advancements in LLMs have shifted focus toward scaling inference-time compute, improving performance without retraining the model. A common approach is to sample multiple outputs in parallel, and select one of these as the final output. However, work to date has focused on English and a handful of domains such as math and code. In contrast, we are most interested in techniques that generalize across open-ended tasks, formally verifiable tasks, and across languages. In this work, we study how to robustly scale inference-time compute for open-ended generative tasks in a multilingual, multi-task setting. Our findings show that both sampling strategy based on temperature variation and selection strategy must be adapted to account for diverse domains and varied language settings. We evaluate existing selection methods, revealing that strategies effective in English often fail to generalize across languages. We propose novel sampling and selection strategies specifically adapted for multilingual and multi-task inference scenarios, and show they yield notable gains across languages and tasks. In particular, our combined sampling and selection methods lead to an average +6.8 jump in win-rates for our 8B models on m-ArenaHard-v2.0 prompts, against proprietary models such as Gemini. At larger scale, Command-A (111B model) equipped with our methods, shows +9.0 improvement in win-rates on the same benchmark with just five samples against single-sample decoding, a substantial increase at minimal cost. Our results underscore the need for language- and task-aware approaches to inference-time compute, aiming to democratize performance improvements in underrepresented languages.

Summary

The paper demonstrates that increasing inference compute through hedged sampling and advanced selection strategies significantly boosts multilingual LLM performance.
It introduces novel selection methods such as CHOPS and X-MBR that outperform traditional Best-of-N approaches, improving win-rates in diverse tasks.
Experimental results show that inference-time compute scaling democratizes AI performance across underrepresented languages without necessitating costly retraining.

"When Life Gives You Samples": A Summary of Multilingual LLM Inference Scaling

Introduction

The paper "When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs" explores the strategies for enhancing performance of LLMs by scaling inference-time compute rather than the usual method of model retraining or increasing model size. The research focuses on multilingual, multi-task scenarios where inference compute is strategically increased through sampling and selection methods, aiming to improve performance across diverse languages and open-ended tasks.

Inference-Time Compute Strategies

Sampling Strategies

The research distinguishes itself by investigating diverse temperature sampling strategies to create a robust pool of language outputs. Temperature sampling manipulates the diversity and quality of model predictions by adjusting the probability distribution of token predictions. Key insights from Figure 1 reveal significant variance in sample quality across languages and tasks at different temperatures. The study observes that non-English languages tend to experience a higher variance in sample quality at increased temperature settings, necessitating careful strategizing to maintain sample quality.

Figure 1: Quality under single temperature sampling for various tasks and languages, illustrating increased variance with higher temperatures.

To mitigate the risks of high variance, the paper introduces hedged sampling, combining stochastic generation with deterministic greedy outputs. This approach provides a safety net, improving average win-rates for multilingual tasks.

Selection Strategies

The selection of high-quality samples from the collected pool is critical. The paper contrasts traditional methods like Maximum Likelihood and Best-of-N (BoN) with innovative approaches such as Minimum Bayes Risk (MBR) and proposes novel methods:

Checklisted One-Pass Selection (CHOPS): Utilizes LLMs to generate a checklist that guides the evaluation of all candidates simultaneously, offering a cost-effective and scalable solution.
Cross-lingual MBR (X-MBR): Incorporates cross-lingual samples into the evidence set, enhancing selection robustness by leveraging the multilingual capabilities of LLMs. This method has demonstrated significant improvements over single-language evidence baselines.
Figure 2: Comparison of baselines vs RM and LLM Judge on N=5 generations, showing improvement in win-rate on mArenaHard.

Experimental Evaluations

The experiments conducted cover a range of benchmarks including open-ended generation, mathematical reasoning, and machine translation. The results confirm substantial performance gains using the proposed sampling and selection methodologies, under both intrinsic comparison against greedy outputs and extrinsic comparison against more powerful models like Gemini 2.0 Flash.

Figure 3: Overview of the multilingual multi-task experimental scope with annotated new methods.

Performance Improvements and Implications

Hedged Sampling and Min-p Integration: The study's results highlight the effectiveness of integrating token-level truncation (min-p) with hedged sampling, further securing sample quality.
Cross-Lingual Evidence: By adding cross-lingual samples, X-MBR provides a consistent edge, particularly in non-English language settings, demonstrating the strength of multilingual capabilities in LLMs.
Practical Implications: The findings advocate for inference-time strategies that significantly democratize AI performance across underrepresented languages, without the need for costly model training.

Conclusion

This research presents valuable strategies for scaling LLM performance at inference time across multilingual tasks. By introducing novel sampling and selection methods such as CHOPS and X-MBR, the study emphasizes efficient compute utilization that leads to robust multilingual generalizations. The implications extend beyond current generative model capabilities, suggesting pathways for future explorations in self-improvement frameworks and broader multilingual applications in AI systems.

The paper encourages continued exploration of inference scaling strategies to further leverage the latent capabilities of multilingual models in diverse and compute-sensitive settings.