Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems (2403.02419v2)

Published 4 Mar 2024 in cs.LG, cs.AI, cs.CL, cs.SY, and eess.SY

Abstract: Many recent state-of-the-art results in language tasks were achieved using compound systems that perform multiple LLM (LM) calls and aggregate their responses. However, there is little understanding of how the number of LM calls - e.g., when asking the LM to answer each question multiple times and taking a majority vote - affects such a compound system's performance. In this paper, we initiate the study of scaling properties of compound inference systems. We analyze, theoretically and empirically, how the number of LM calls affects the performance of Vote and Filter-Vote, two of the simplest compound system designs, which aggregate LM responses via majority voting, optionally applying LM filters. We find, surprisingly, that across multiple language tasks, the performance of both Vote and Filter-Vote can first increase but then decrease as a function of the number of LM calls. Our theoretical results suggest that this non-monotonicity is due to the diversity of query difficulties within a task: more LM calls lead to higher performance on "easy" queries, but lower performance on "hard" queries, and non-monotone behavior can emerge when a task contains both types of queries. This insight then allows us to compute, from a small number of samples, the number of LM calls that maximizes system performance, and define an analytical scaling model for both systems. Experiments show that our scaling model can accurately predict the performance of Vote and Filter-Vote systems and thus find the optimal number of LM calls to make.

Authors (7)

Lingjiao Chen (27 papers)
Jared Quincy Davis (10 papers)
Boris Hanin (50 papers)
Peter Bailis (44 papers)
Ion Stoica (177 papers)
Matei Zaharia (101 papers)
James Zou (232 papers)

Summary

Large-scale language workflows increasingly rely on “compound inference systems” that call an LLM many times and aggregate the outputs (e.g., CoT@32 in Gemini). Practitioners often assume that more calls monotonically improve accuracy, but this paper shows the assumption is false, characterises when the degradation happens, and provides tooling to choose the right number of calls without exhaustive grid-search.

1 Setting

System studied One-layer Voting Inference Network (VIN).

1
2
3

for query x:
    z1 … zK ← LLM(x, θ1 … θK)          # θk ∼ Θ  (prompt variants, temperature, or even different models)
    ŷ       ← mode(z1 … zK)             # majority vote

Key variable Number of calls K (ensemble size).
Difficulty model For each query x, an individual LLM call is correct with probability r(x).
- “Easy” item: r(x) > 0.5 – “Hard” item: r(x) < 0.5
- Experiments (and most theory) use a 2-level mixture:
- Pr[easy] = α, r_easy = p₁, r_hard = p₂ (p₁ > p₂).

2 Empirical Observation

Across GPT-3.5 experiments on MMLU, accuracy rises for small K, peaks, then drops as K grows (Fig. 1 in paper). The same “inverse-U” appears in synthetic data and other real tasks (business ethics, chemistry).

3 Theory: Why More Calls Can Hurt

Using the incomplete beta function the authors derive a closed-form accuracy:

F(K) = α I_{p₁}((K+1)/2,(K+1)/2) + (1−α) I_{p₂}((K+1)/2,(K+1)/2) (1)

From this they prove:

Theorem 1 Let t = p₂(1−p₂)(½−p₂) / [ p₁(1−p₁)(p₁−½) ] + 1.

If p₂ < ½ < p₁ (mixture of easy & hard):

If p₁ + p₂ > 1 and α ≥ 1 − 1/t ⇒ F(K) increases monotonically.
If p₁ + p₂ < 1 and α ≤ 1 − 1/t ⇒ F(K) decreases monotonically.
If p₁ + p₂ > 1 but α < 1 − 1/t ⇒ F(K) up then down (“inverse-U”).
If p₁ + p₂ < 1 but α > 1 − 1/t ⇒ F(K) down then up (“U”).

Interpretation Accuracy on easy items rises with K, on hard items shrinks with K. The mixture weight α decides which trend dominates.

4 Optimal Ensemble Size

The unique K* that maximises accuracy (for the inverse-U scenario) is:

K* ≈ 2 * log[(α/(1−α))·((2p₁−1)/(1−2p₂))] / log p₂(1−p₂) / (p₁(1−p₁))

(rounded to the nearest odd integer). In practice K* can be as small as 3-5 even when the upstream model is weak.

5 Practical Scaling Law

Because β-functions are cumbersome to fit, the authors propose an approximator:

G(K) = α·g_{p₁}(K; c₁) + (1−α)·g_{p₂}(K; c₂)

with

g_p(K; c) = 1 − exp(−cᵀ[K,√K,1]) if p > 0.5 exp(−cᵀ[K,√K,1]) if p < 0.5 (3)

Simple exponential fits capture the empirically rapid convergence/divergence.

Parameter-estimation algorithm (Alg. 1):

1 For each training item, do 1–5 calls, compute majority vote, label item as easy/hard. 2 Fit (3) to those points via least-squares to get c. 3 Average g_p(·) over training items → Ĝ(K).

Only a handful of calls are needed to predict accuracy for K up to 1000.

6 Experiments

Synthetic All eight α, p₁, p₂ settings predicted within 1e-6–1e-4 MSE; analytic K* matches brute-force search exactly (Table 1 in paper).

Real LLM GPT-3.5-turbo on three MMLU subsets:

Dataset	Easy share	Empirical best-K	Predicted best-K	Peak gain vs K=1
College Math	42 %	5	5	+8.2 pp
Business Ethics	36 %	3	4	+6.1 pp
College Chemistry	62 %	1000 (monotone)	monotone	+13.4 pp

Scaling law tracks the full accuracy curve (blue vs orange dots in Fig. 4) and identifies the same optimum without measuring every K.

7 Difficulty Prediction Study

Prompted GPT-4 is asked: “Given the question and the VIN prediction, is this query difficult for the VIN?” Zero-shot accuracy: 66–72 % (GPT-3.5 ≈ dummy baseline). Hardness detectors enable adaptive K policies, an open research problem.

8 Key Takeaways for Practitioners

1 More calls can decrease accuracy when a substantial fraction of inputs are “hard” (single-call success rate < 0.5). 2 Before committing to large CoT@32-style ensembles, sample a few K values and fit the scaling law; this often points to a much smaller optimal K, saving latency and API cost. 3 Dynamic policies (estimate difficulty → choose K) have untapped potential; GPT-4 already provides usable difficulty signals. 4 The analysis currently covers majority-vote ensembles; other aggregation schemes (reranking, self-consistency with LLM judges, AlphaCode-style test-case filtering) require separate scaling laws—an open direction highlighted by the authors.

9 Limitations & Future Work

Theory uses a two-difficulty mixture and binary answer space; extension to continuous difficulty and large answer sets is sketched but not fully proved.
Assumes i.i.d. calls; correlated generation strategies may change behaviour.
Difficulty prediction is still noisy; better predictors or proxy metrics (e.g., log-prob gap) could improve adaptive systems.
Scaling laws for multi-layer or heterogeneous compound systems (LLM + search, code execution, etc.) remain unexplored.

The paper provides both a cautionary tale—“bigger ensembles are not always better”—and a practical recipe to size LLM ensembles systematically instead of by trial-and-error.

PDF Markdown

Related Papers

Tweets

https://twitter.com/james_y_zou/status/1765413649643159925

https://twitter.com/ChenLingjiao/status/1765181438482358273

https://twitter.com/ChenLingjiao/status/1798849142288245071

https://twitter.com/fly51fly/status/1765495094508233004

https://twitter.com/saranormous/status/1765406039502102860

https://twitter.com/justinxzhao/status/1867629519915692210

YouTube

Show All Videos