Inverse-Entropy Weighted Voting for LLMs
- Inverse-entropy weighted voting is a method that quantifies token-level uncertainty using average Shannon entropy to weight and aggregate LLM reasoning chains.
- The aggregation rule assigns chains weights inversely proportional to their entropy, enabling statistically sound answer selection over standard majority voting.
- Empirical evaluations demonstrate consistent accuracy gains across diverse models and benchmarks, with minimal computational overhead in both parallel and sequential settings.
Inverse-entropy weighted voting (IEW) is a training-free aggregation method for reasoning chains produced by LLMs that leverages token-level uncertainty to improve the selection of final answers. By assigning greater influence to chains with low internal entropy, IEW provides a statistically grounded mechanism for integrating multiple chain-of-thought (CoT) outputs, outperforming standard majority voting ("self-consistency") in both parallel and sequential test-time inference schemes. IEW operates without the need for additional model queries or tuning, and offers consistent empirical gains across diverse open source models and reasoning benchmarks.
1. Formal Definition of Chain Entropy
IEW quantifies uncertainty in each LLM reasoning chain using the average Shannon entropy over the chain’s token-level probability distributions. Specifically, for a set of generated chains , each chain generates a sequence of tokens, with per-token probability vectors , where is the vocabulary size considered (typically the top- tokens), and is the chain length. The average entropy for chain is
where is the normalized probability of token at position .
The entropy concretely measures the “spread” of the model’s belief at each step: low entropy implies peaked, confident distributions, while high entropy indicates uncertainty or ambiguity during the next-token prediction process.
2. Inverse-Entropy Weighted Aggregation Rule
IEW assigns each chain a weight inversely proportional to its entropy:
where is a small constant to avoid division by zero.
To derive a probability distribution over the chains' contributions, weights are normalized:
Aggregated answer selection is performed by summing weights for chains producing each answer :
and returning
where is the answer produced by chain , and is the set of all possible answers.
These steps yield a confidence-weighted vote in which more self-consistent (low-entropy) reasoning trajectories have a larger impact on the ensemble prediction.
3. Comparison to Majority Voting: Algorithmic Protocol
IEW can be contrasted with unweighted majority voting (self-consistency) using the following protocols:
| Step | Inverse-Entropy Weighted Voting (IEW) | Standard Majority Voting (Self-Consistency) |
|---|---|---|
| 1 | For each chain, extract token probabilities, compute entropy , and assign weight . | For each chain, extract answer . |
| 2 | Normalize weights: . | Count number of votes per answer . |
| 3 | For each distinct answer , sum for with . | Select answer with most votes: . |
| 4 | Return maximizing the weighted sum . | Return the majority answer. |
IEW thus considers not only the frequency but also the confidence embedded in the token prediction process.
4. Theoretical Motivation
IEW's weighting rationale is rooted in several principles:
- Confidence Quantification: A chain with low entropy is more likely to encode a coherent, certain chain of reasoning, as the model is assigning most probability mass to the next token at each step.
- Noise Attenuation: High-entropy chains reflect uncertainty or indecision, so their down-weighting acts as a filter against error-prone or misguided reasoning sequences.
- Information-Theoretic Justification: As Shannon entropy is a fundamental quantitative measure of uncertainty, its inverse serves as a natural proxy for confidence in probabilistic outputs.
This mechanism does not require additional model calls or training, and leverages intrinsic properties of the LLM’s output distributions.
5. Empirical Performance and Benchmarks
IEW exhibits consistent improvements over self-consistency across both parallel and sequential reasoning paradigms. On parallel self-consistency (6 chains), IEW achieved gains of 0.5–3.4% over majority voting across diverse open-source LLMs and benchmarks. In the sequential refinement regime, IEW was best in 29 out of 30 configurations (97%) evaluated. Maximum observed gains reach up to 6.7% on some settings.
Example Benchmark Results
| Configuration | Majority | Entropy-Weighted |
|---|---|---|
| GPT-OSS-20B, AIME | 50.0% | 53.3% |
| GPT-OSS-120B, AIME | 53.3% | 56.7% |
| Qwen3-235B, GPQA | 67.7% | 68.2% |
| Kimi-K2, GPQA-Diamond (sequential) | 73.7% | 74.8% |
These results indicate that entropy-based voting delivers consistent, if sometimes modest, improvements in test-time accuracy under matched compute budgets.
6. Computational Properties and Implementation Guidance
IEW incurs negligible additional computational cost compared to standard majority voting. The primary overhead is the calculation of for each chain, which is , with denoting the number of top tokens considered for entropy estimation ( suffices for stable results). Since chain lengths are typically bounded and entropy is computed post-hoc from already available outputs, the extra wall-time cost is typically on the order of milliseconds—even in sequential settings where the overall latency is dominated by the chain generation itself.
IEW requires only basic arithmetic operations and does not necessitate any further model evaluation or gradient-based optimization.
7. Application Example and Practical Recommendations
A worked example demonstrates the voting mechanism. Consider three chains with entropies , , and corresponding answers A, B, and A:
Weights are assigned:
Normalizing, the weights become:
The total for answer A is $0.336 + 0.503 = 0.839$; for B, $0.161$. The aggregated answer is A, consistent with the majority vote, but IEW would allow a low-entropy but minority answer to overturn the result if warranted by confidence considerations.
Recommended best practices include using a value of at least for top_logprobs during generation, employing to avoid division errors, and setting the chain count to six under matched compute constraints, as this has empirically proven effective. If a chain fails to return log-probabilities, defaulting to majority voting if fewer than two valid chains remain is advisable.
—
Inverse-entropy weighted voting augments CoT ensemble prediction by weighting answers according to internal model uncertainty, producing consistent performance improvements with negligible computational overhead and without additional model queries. Its theoretical foundation and empirical success suggest it is a robust aggregation mechanism for LLM reasoning outputs at test time, especially within sequential refinement paradigms.