Factorized Active Querying (FAQ)
- FAQ is a methodology for efficient finite-population inference in LLM benchmarking, using adaptive querying to maximize information gain.
- It employs a Bayesian factor model and a hybrid variance-reduction/active-learning policy to construct tight, valid confidence intervals.
- Empirical evaluations show up to 5× improvement in effective sample size over uniform sampling, reducing evaluation costs significantly.
Factorized Active Querying (FAQ) is a methodology for efficient finite-population inference in benchmarking LLMs, especially in settings where exhaustive evaluation across large question banks and many LLMs is impractical. FAQ constructs tight confidence intervals (CIs) for model accuracy leveraging historical performance through a Bayesian factor model, and adaptively samples evaluation queries with a hybrid variance-reduction/active-learning policy. Valid frequentist coverage is guaranteed by Proactive Active Inference (PAI)—a finite-population extension of active inference—that supports direct and adaptive query selection while controlling CI validity. FAQ delivers up to increases in effective sample size (ESS) over conventional uniform sampling, thereby achieving the same CI width with significantly fewer queries and negligible computational overhead (Wu et al., 28 Jan 2026).
1. Finite‐Population Inference in LLM Benchmarking
Let the question bank contain items. For a new LLM, denote as correctness on question , with finite-population accuracy defined as
Historical model outcomes are encoded in , where many entries may be missing. Given a query budget , the goal is to adaptively sample questions, observe their , and construct a -CI for over the fixed bank.
Uniform random sampling yields unbiased accuracy estimates and Wald confidence intervals but fails to account for heterogeneity in question difficulty and informative signals from historical data. FAQ exploits these by modeling latent structure and selectively sampling queries to maximize information gain, thereby reducing estimator variance and increasing ESS. Empirically, FAQ attains the same CI width as uniform sampling with up to fewer queries.
2. Bayesian Factor Model for Historical Data
FAQ models historical outcomes via a low-rank Bayesian factor model. Each historical model is represented by a -dimensional latent proficiency ; each question by a latent requirement . The conditional probability of correctness is
Parameter estimation involves minimizing the masked logistic loss with regularization: where iff is observed; are selected by cross-validation.
For a new model, with empirical moments from historical . After each query , the approximate posterior is updated using Laplace approximation:
3. Hybrid Variance-Reduction / Active-Learning Sampling Policy
At each round , FAQ computes a query distribution over , balancing:
- Oracle variance-reduction: For a Bernoulli superpopulation, the minimal-variance sampling is . In practice, use .
- Active-learning: Score based on estimated reduction in ; let denote the one-step downward variance update (computed with Laplace/Delta approximation).
Normalizing, mixing, and tempering these scores leads to probabilities:
The final query distribution is
The adaptive weights , , and control exploration and exploitation. This composite policy actively concentrates queries where they are most informative for reducing uncertainty in model accuracy estimates, and for accelerating adaptation of the latent factor for the new LLM.
Pseudocode
1 2 3 4 5 6 7 8 9 10 11 12 13 |
Algorithm: Factorized Active Querying (FAQ)
Input: historical H, budget n_b, level α
1. Fit factor model on H → prior N(û⁽⁰⁾, Σ̂⁽⁰⁾)
2. for t=1…n_b do
• Compute scores s_o, s_a from û⁽ᵗ⁻¹⁾, Σ̂⁽ᵗ⁻¹⁾
• Form q_t via (3.3): mix/temper oracle & active
• Sample I_t∼q_t; query z_{I_t}
• Update û⁽ᵗ⁾,Σ̂⁽ᵗ⁾ via (2.3)–(2.4)
• Compute φ_t = N_q⁻¹∑_j p̂_j⁽ᵗ⁻¹⁾ + N_q⁻¹ (z_{I_t}-p̂_{I_t}⁽ᵗ⁻¹⁾)/q_t(I_t)
end for
3. Estimate θ̂= n_b⁻¹∑_t φ_t
4. Estimate variance σ̂² via martingale-variance formula
5. Return θ̂ ± z_{1-α/2}·σ̂/√n_b |
4. Proactive Active Inference and Confidence Interval Validity
FAQ employs Proactive Active Inference (PAI) to ensure unbiasedness and CI coverage under finite-population, adaptive sampling. At round , the estimator
is a martingale difference. The accumulated average is unbiased for for any .
The variance estimator is
A martingale CLT justifies reporting the Wald-type interval , yielding asymptotic coverage without independence or correct model assumptions.
5. Theoretical Properties and Effective Sample Size
- The oracle minimal-variance policy (Theorem 3.1) in a Bernoulli superpopulation is .
- Under boundedness and statistical regularity, a martingale CLT (Theorem 4.1) applies:
- Effective sample size is defined as for method M. FAQ achieves ESS multipliers of up to versus uniform sampling and over the strongest active-inference baselines.
6. Benchmarking, Evaluation, and Empirical Results
FAQ was evaluated on historical outcomes for LLMs and questions, partitioned into two suites: MMLU-Pro () and a combined suite of BBH, GPQA, IFEval, MATH, MuSR (). The historical-test split sets models as historical and for testing. Experiments simulated various levels of missingness, ranging from fully observed to sparse ( observed).
Baselines included uniform sampling (Wald intervals) and sequential active inference (AIPW) [Zrnic & Candes, 2024], with oracle-tuned labeling policies. At query budget, the following empirical results were observed:
| Method | ESS× (MMLU-Pro) | Coverage (MMLU-Pro) | ESS× (BBH+…) | Coverage (BBH+…) |
|---|---|---|---|---|
| Uniform | 1.00 | 0.95 | 1.00 | 0.95 |
| Best AIPW | 1.8 | 0.94 | 2.2 | 0.94 |
| FAQ | 4.5 | 0.95 | 4.8 | 0.95 |
FAQ’s CI widths are about $1/5$ those of uniform sampling at small budgets, with coverage consistently close to . Coverage remains stable across model vintages and true accuracies without systematic bias. A plausible implication is that FAQ offers substantial practical value in high-throughput LLM evaluation with constrained annotation budgets while maintaining rigorous statistical guarantees.
7. Conclusion and Significance
FAQ integrates a Bayesian logistic factor model, a hybrid Neyman/active-learning adaptive sampling policy, and a martingale-based inference engine via PAI, yielding efficient, robust, and statistically valid benchmarking of LLMs on large question banks. In comprehensive empirical evaluation, FAQ achieves up to cost savings over uniform sampling at equivalent CI width, with minor computation and flexible adaptation to varying levels of historical data availability (Wu et al., 28 Jan 2026). Its methodology and codebase provide a reproducible and extensible framework for principled evaluation in the era of rapidly proliferating LLM variants.