Large-scale language workflows increasingly rely on “compound inference systems” that call an LLM many times and aggregate the outputs (e.g., CoT@32 in Gemini). Practitioners often assume that more calls monotonically improve accuracy, but this paper shows the assumption is false, characterises when the degradation happens, and provides tooling to choose the right number of calls without exhaustive grid-search.
1 Setting
- System studied One-layer Voting Inference Network (VIN).
1
2
3
|
for query x:
z1 … zK ← LLM(x, θ1 … θK) # θk ∼ Θ (prompt variants, temperature, or even different models)
ŷ ← mode(z1 … zK) # majority vote |
- Key variable Number of calls K (ensemble size).
- Difficulty model For each query x, an individual LLM call is correct with probability r(x).
- “Easy” item: r(x) > 0.5 – “Hard” item: r(x) < 0.5
- Experiments (and most theory) use a 2-level mixture:
- Pr[easy] = α, r_easy = p₁, r_hard = p₂ (p₁ > p₂).
2 Empirical Observation
Across GPT-3.5 experiments on MMLU, accuracy rises for small K, peaks, then drops as K grows (Fig. 1 in paper). The same “inverse-U” appears in synthetic data and other real tasks (business ethics, chemistry).
3 Theory: Why More Calls Can Hurt
Using the incomplete beta function the authors derive a closed-form accuracy:
F(K) = α I_{p₁}((K+1)/2,(K+1)/2) + (1−α) I_{p₂}((K+1)/2,(K+1)/2) (1)
From this they prove:
Theorem 1 Let
t = p₂(1−p₂)(½−p₂) / [ p₁(1−p₁)(p₁−½) ] + 1.
If p₂ < ½ < p₁ (mixture of easy & hard):
- If p₁ + p₂ > 1 and α ≥ 1 − 1/t ⇒ F(K) increases monotonically.
- If p₁ + p₂ < 1 and α ≤ 1 − 1/t ⇒ F(K) decreases monotonically.
- If p₁ + p₂ > 1 but α < 1 − 1/t ⇒ F(K) up then down (“inverse-U”).
- If p₁ + p₂ < 1 but α > 1 − 1/t ⇒ F(K) down then up (“U”).
Interpretation Accuracy on easy items rises with K, on hard items shrinks with K. The mixture weight α decides which trend dominates.
4 Optimal Ensemble Size
The unique K* that maximises accuracy (for the inverse-U scenario) is:
K* ≈ 2 * log[(α/(1−α))·((2p₁−1)/(1−2p₂))] / log p₂(1−p₂) / (p₁(1−p₁))
(rounded to the nearest odd integer). In practice K* can be as small as 3-5 even when the upstream model is weak.
5 Practical Scaling Law
Because β-functions are cumbersome to fit, the authors propose an approximator:
G(K) = α·g_{p₁}(K; c₁) + (1−α)·g_{p₂}(K; c₂)
with
g_p(K; c) = 1 − exp(−cᵀ[K,√K,1]) if p > 0.5
exp(−cᵀ[K,√K,1]) if p < 0.5 (3)
Simple exponential fits capture the empirically rapid convergence/divergence.
Parameter-estimation algorithm (Alg. 1):
1 For each training item, do 1–5 calls, compute majority vote, label item as easy/hard.
2 Fit (3) to those points via least-squares to get c.
3 Average g_p(·) over training items → Ĝ(K).
Only a handful of calls are needed to predict accuracy for K up to 1000.
6 Experiments
Synthetic All eight α, p₁, p₂ settings predicted within 1e-6–1e-4 MSE; analytic K* matches brute-force search exactly (Table 1 in paper).
Real LLM GPT-3.5-turbo on three MMLU subsets:
Dataset |
Easy share |
Empirical best-K |
Predicted best-K |
Peak gain vs K=1 |
College Math |
42 % |
5 |
5 |
+8.2 pp |
Business Ethics |
36 % |
3 |
4 |
+6.1 pp |
College Chemistry |
62 % |
1000 (monotone) |
monotone |
+13.4 pp |
Scaling law tracks the full accuracy curve (blue vs orange dots in Fig. 4) and identifies the same optimum without measuring every K.
7 Difficulty Prediction Study
Prompted GPT-4 is asked: “Given the question and the VIN prediction, is this query difficult for the VIN?” Zero-shot accuracy: 66–72 % (GPT-3.5 ≈ dummy baseline). Hardness detectors enable adaptive K policies, an open research problem.
8 Key Takeaways for Practitioners
1 More calls can decrease accuracy when a substantial fraction of inputs are “hard” (single-call success rate < 0.5).
2 Before committing to large CoT@32-style ensembles, sample a few K values and fit the scaling law; this often points to a much smaller optimal K, saving latency and API cost.
3 Dynamic policies (estimate difficulty → choose K) have untapped potential; GPT-4 already provides usable difficulty signals.
4 The analysis currently covers majority-vote ensembles; other aggregation schemes (reranking, self-consistency with LLM judges, AlphaCode-style test-case filtering) require separate scaling laws—an open direction highlighted by the authors.
9 Limitations & Future Work
- Theory uses a two-difficulty mixture and binary answer space; extension to continuous difficulty and large answer sets is sketched but not fully proved.
- Assumes i.i.d. calls; correlated generation strategies may change behaviour.
- Difficulty prediction is still noisy; better predictors or proxy metrics (e.g., log-prob gap) could improve adaptive systems.
- Scaling laws for multi-layer or heterogeneous compound systems (LLM + search, code execution, etc.) remain unexplored.
The paper provides both a cautionary tale—“bigger ensembles are not always better”—and a practical recipe to size LLM ensembles systematically instead of by trial-and-error.