Blend-ASC: Adaptive Sampling for LLM Self-Consistency

Updated 22 November 2025

Blend-ASC is a dynamic-sample-allocation algorithm that optimizes test-time self-consistency inference by adaptively assigning samples per question.
It employs a blended ranking mechanism combining Beta-posterior estimates from adaptive SC and PPR-1v1 to effectively distribute a fixed prompt budget.
Empirical evaluations on benchmarks like GSM8K and MATH show Blend-ASC achieves up to 6.9× sample efficiency compared to traditional methods.

Blend-ASC is a dynamic-sample-allocation algorithm for test-time self-consistency (SC) inference with LLMs, introduced to address the sample efficiency and budget allocation limitations inherent to classical self-consistency methods. SC itself refers to the inference protocol in which multiple chain-of-thought (CoT) responses are sampled for each question, with the most frequent answer taken as the output. Blend-ASC enables practitioners to fit the inference to any arbitrary sample budget by adaptively allocating samples across questions, thereby minimizing overall error within the specified computational constraints. Significantly, Blend-ASC is hyperparameter-free and empirically achieves state-of-the-art sample efficiency, requiring an average of 6.8× fewer samples than vanilla SC for matched accuracy on major LLM reasoning benchmarks such as GSM8K and MATH (Feng et al., 15 Nov 2025).

1. Motivation and Problem Formulation

Traditional self-consistency operates by a fixed allocation: each question $q$ receives $n$ independent CoT samples from the LLM, with the modal answer returned. The empirical mode $r_{\text{SC}} = \arg\max_{r} \hat{\mu}_n(r)$ (where $\hat{\mu}_n$ denotes empirical frequency) serves as the prediction. This fixed-per-question budget is known to be wasteful: "easy" questions—those with a large margin $m = (\sqrt{p_1} - \sqrt{p_2})^2$ between the top two answer probabilities—attain near-zero error with few samples, while "hard" small-margin questions dominate aggregate sampling requirements. Realistic deployments, however, impose a strict total inference sample budget $T$ (i.e., total number of LLM prompts), which classical SC cannot flexibly distribute. Blend-ASC directly addresses the objective of minimizing aggregate dataset-level error under a constraint $\sum_{i} x_i = T$ , adaptively allocating samples $x_i$ per question $q_i$ (Feng et al., 15 Nov 2025).

2. Theoretical Foundations and Scaling Laws

The self-consistency protocol is mathematically framed as empirical mode estimation. For question $q$ whose answer distribution $\mu(r \mid q)$ is mode-aligned (i.e., true answer $r^*$ is the mode), the error rate after $x$ samples satisfies:

$\text{Err}(x, q) \leq \exp(-x \cdot m),$

where $p_1$ and $p_2$ are the top two probabilities, and $m = (\sqrt{p_1} - \sqrt{p_2})^2$ is the margin. This exponential decay of error with respect to samples is empirically verified across model families and datasets: the margin $m$ accurately predicts the sample complexity for confidence in majority-vote (Feng et al., 15 Nov 2025). When evaluating over a dataset with varying $m$ , the aggregate error follows a power-law scaling:

$\text{Err}(x, \mathcal{D}) \propto x^{-1/2}$

for common distributions of margin values. Fixed-allocation SC thus exhibits sublinear convergence at the dataset level. The analysis further demonstrates that adaptive strategies, if granted access to (or estimation of) per-question margin, can surpass the $x^{-1/2}$ scaling law and attain linear ($1/x$) convergence rates, a theoretical optimum for sample use (Feng et al., 15 Nov 2025).

3. Algorithmic Structure of Blend-ASC

Blend-ASC integrates two strands of confidence estimation—adaptive SC (ASC) and sequential mode estimation (PPR-1v1)—to arrive at a sample allocation strategy that blends their strengths across the low- and high-sample regimes, while operating under exact adherence to user-specified prompt budgets. At each incremental sample allocation $t \in \{1, ..., T\}$ , Blend-ASC performs:

For each question $q_i$ , maintain sample counts $x_i$ and counts $\{n_1, n_2\}_i$ for the two most frequent answers.
Compute two statistics:
- $R_1(q_i) = \int_0^{1/2} \text{Beta}(u; n_1+1, n_2+1)\, du$ (adaptive SC Beta-posterior probability that mode is incorrect)
- $R_2(q_i) = (K-1) \cdot \text{Beta}(u=1/2; n_1+1, n_2+1)$ (PPR-1v1 mode confidence; $K$ is number of classes seen)
For each $q_i$ , compute blended priority $\alpha_t^i = (1-t/T)\cdot \text{rank}_1^i + (t/T)\cdot \text{rank}_2^i$ , where each $\text{rank}_*^i$ is the rank of $R_*^i$ across $i$ .
Select question $q^*$ with minimal priority $\alpha_t^i$ , draw an additional sample for $q^*$ .
Enforce a hard upper limit: if any $x_i$ exceeds $16$ times the average per-question samples, halt allocation for that $q_i$ .

No tuning parameters are required: Blend-ASC is fully hyperparameter-free. Sampling proceeds until the global budget $T$ is exactly exhausted, at which point each $r_{\text{SC}}^i$ over $q_i$ is determined by the majority vote of its drawn samples (Feng et al., 15 Nov 2025).

4. Key Formulas and Confidence Estimators

Blend-ASC relies on principled probabilistic estimates:

Beta-posterior probability of incorrect mode:

$R_1 = \int_0^{1/2} \mathrm{Beta}(u; n_1+1, n_2+1) du,$

which estimates the confidence that the empirical mode is spurious, used for prioritizing "hard" questions early.

PPR-1v1 stopping proxy:

$R_2 = (K-1)\cdot \mathrm{Beta}(u=1/2; n_1+1, n_2+1).$

This martingale confidence bound ensures sample-optimality in the high-sample asymptote, acting as a virtual p-value for stopping.

A transition/blending mechanism interpolates between these as sampling proceeds, ensuring robust allocation both when little information is known (short runs) and in the regime where enough data secure high-confidence stopping.

5. Sample Efficiency and Empirical Performance

Extensive experiments across reasoning benchmarks (GSM8K, MATH, GPQA-Diamond) and LLMs (LLaMA-3B, Qwen-Math) show that Blend-ASC consistently achieves lower error at the same prompt count relative to SC and all existing adaptive/fixed-allocation baselines, irrespective of temperature (0.6–1.0). Average prompt savings to match fixed- $n$ SC are as follows (Feng et al., 15 Nov 2025):

Setting	Fixed-Alloc	Adaptive SC	Blend-ASC
SC@64	4.6×	5.9×	6.8×
SC@128	4.6×	5.0×	6.9×

On six model-dataset combinations, Blend-ASC error curves are below all comparators, with the blended allocation closely matching near-theoretical optimality for a broad spectrum of task hardness (Feng et al., 15 Nov 2025). The protocol operates without requiring threshold selection, window size tuning, or temperature reconfiguration—parameters often problematic in production.

6. Implementation, Integration, and Limitations

Blend-ASC is designed for straightforward integration: it requires only tracking answer counts per question and updating Beta-based statistics; sample allocation can be efficiently managed via a heap or priority queue. The practitioner specifies the overall sample budget $T$ ; no per-question or global hyperparameters are set. Empirical stopping can utilize blended priorities to terminate early if a predetermined confidence is attained per question.

While the 16×-oversampling cutoff for any given question is a heuristic rather than a theoretically minimal bound, and PPR-1v1 can be pessimistic at very low sample counts, these practical aspects are minor in standard use cases. Blend-ASC is applicable without modification to any SC-compatible domain.

Blend-ASC generalizes and improves on several prior self-consistency allocation strategies. Classical SC and fixed-alloc schemes fail to exploit inter-question heterogeneity in answer margin. Adaptive SC [Aggarwal et al.] prioritizes uncertain questions, improving low-sample regimes but does not achieve optimal scaling asymptotically. The PPR-1v1 martingale approach achieves theoretical optimality at higher sample counts, but can underperform early, and requires parameterization. Blend-ASC uniquely covers both short- and long-sample domains with a unified, nonparametric mechanism and is distinguished by its provable error scaling and empirical efficiency. All theoretical rates and design choices are validated empirically across multiple LLMs and real-world benchmarks (Feng et al., 15 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Optimal Self-Consistency for Efficient Reasoning with Large Language Models (2025)

Follow Topic

Get notified by email when new papers are published related to Blend-ASC.