Factorized Active Querying (FAQ)

Updated 4 February 2026

FAQ is a methodology for efficient finite-population inference in LLM benchmarking, using adaptive querying to maximize information gain.
It employs a Bayesian factor model and a hybrid variance-reduction/active-learning policy to construct tight, valid confidence intervals.
Empirical evaluations show up to 5× improvement in effective sample size over uniform sampling, reducing evaluation costs significantly.

Factorized Active Querying (FAQ) is a methodology for efficient finite-population inference in benchmarking LLMs, especially in settings where exhaustive evaluation across large question banks and many LLMs is impractical. FAQ constructs tight confidence intervals (CIs) for model accuracy leveraging historical performance through a Bayesian factor model, and adaptively samples evaluation queries with a hybrid variance-reduction/active-learning policy. Valid frequentist coverage is guaranteed by Proactive Active Inference (PAI)—a finite-population extension of active inference—that supports direct and adaptive query selection while controlling CI validity. FAQ delivers up to $5\times$ increases in effective sample size (ESS) over conventional uniform sampling, thereby achieving the same CI width with significantly fewer queries and negligible computational overhead (Wu et al., 28 Jan 2026).

1. Finite‐Population Inference in LLM Benchmarking

Let the question bank contain $N_q$ items. For a new LLM, denote $z_j\in\{0,1\}$ as correctness on question $j$ , with finite-population accuracy defined as

$\theta = \frac{1}{N_q} \sum_{j=1}^{N_q} z_j.$

Historical model outcomes are encoded in $H\in\{0,1,\mathrm{NA}\}^{N_\text{old}\times N_q}$ , where many entries may be missing. Given a query budget $n_b$ , the goal is to adaptively sample $n_b$ questions, observe their $z_j$ , and construct a $(1-\alpha)$ -CI for $N_q$ 0 over the fixed bank.

Uniform random sampling yields unbiased accuracy estimates and Wald confidence intervals but fails to account for heterogeneity in question difficulty and informative signals from historical data. FAQ exploits these by modeling latent structure and selectively sampling queries to maximize information gain, thereby reducing estimator variance and increasing ESS. Empirically, FAQ attains the same CI width as uniform sampling with up to $N_q$ 1 fewer queries.

2. Bayesian Factor Model for Historical Data

FAQ models historical outcomes $N_q$ 2 via a low-rank Bayesian factor model. Each historical model $N_q$ 3 is represented by a $N_q$ 4-dimensional latent proficiency $N_q$ 5; each question $N_q$ 6 by a latent requirement $N_q$ 7. The conditional probability of correctness is

$N_q$ 8

Parameter estimation involves minimizing the masked logistic loss with $N_q$ 9 regularization: $z_j\in\{0,1\}$ 0 where $z_j\in\{0,1\}$ 1 iff $z_j\in\{0,1\}$ 2 is observed; $z_j\in\{0,1\}$ 3 are selected by cross-validation.

For a new model, $z_j\in\{0,1\}$ 4 with empirical moments from historical $z_j\in\{0,1\}$ 5. After each query $z_j\in\{0,1\}$ 6, the approximate posterior $z_j\in\{0,1\}$ 7 is updated using Laplace approximation: $z_j\in\{0,1\}$ 8

3. Hybrid Variance-Reduction / Active-Learning Sampling Policy

At each round $z_j\in\{0,1\}$ 9, FAQ computes a query distribution $j$ 0 over $j$ 1, balancing:

Oracle variance-reduction: For a Bernoulli superpopulation, the minimal-variance sampling is $j$ 2. In practice, use $j$ 3.
Active-learning: Score based on estimated reduction in $j$ 4; let $j$ 5 denote the one-step downward variance update (computed with Laplace/Delta approximation).

Normalizing, mixing, and tempering these scores leads to probabilities: $j$ 6

$j$ 7

The final query distribution is

$j$ 8

The adaptive weights $j$ 9, $\theta = \frac{1}{N_q} \sum_{j=1}^{N_q} z_j.$ 0, and $\theta = \frac{1}{N_q} \sum_{j=1}^{N_q} z_j.$ 1 control exploration and exploitation. This composite policy actively concentrates queries where they are most informative for reducing uncertainty in model accuracy estimates, and for accelerating adaptation of the latent factor for the new LLM.

Pseudocode

$n_b$ 7

4. Proactive Active Inference and Confidence Interval Validity

FAQ employs Proactive Active Inference (PAI) to ensure unbiasedness and CI coverage under finite-population, adaptive sampling. At round $\theta = \frac{1}{N_q} \sum_{j=1}^{N_q} z_j.$ 2, the estimator

$\theta = \frac{1}{N_q} \sum_{j=1}^{N_q} z_j.$ 3

is a martingale difference. The accumulated average $\theta = \frac{1}{N_q} \sum_{j=1}^{N_q} z_j.$ 4 is unbiased for $\theta = \frac{1}{N_q} \sum_{j=1}^{N_q} z_j.$ 5 for any $\theta = \frac{1}{N_q} \sum_{j=1}^{N_q} z_j.$ 6.

The variance estimator is

$\theta = \frac{1}{N_q} \sum_{j=1}^{N_q} z_j.$ 7

A martingale CLT justifies reporting the Wald-type interval $\theta = \frac{1}{N_q} \sum_{j=1}^{N_q} z_j.$ 8, yielding asymptotic $\theta = \frac{1}{N_q} \sum_{j=1}^{N_q} z_j.$ 9 coverage without independence or correct model assumptions.

5. Theoretical Properties and Effective Sample Size

The oracle minimal-variance policy (Theorem 3.1) in a Bernoulli superpopulation is $H\in\{0,1,\mathrm{NA}\}^{N_\text{old}\times N_q}$ 0.
Under boundedness and statistical regularity, a martingale CLT (Theorem 4.1) applies:

$H\in\{0,1,\mathrm{NA}\}^{N_\text{old}\times N_q}$ 1

Effective sample size is defined as $H\in\{0,1,\mathrm{NA}\}^{N_\text{old}\times N_q}$ 2 for method M. FAQ achieves ESS multipliers of up to $H\in\{0,1,\mathrm{NA}\}^{N_\text{old}\times N_q}$ 3 versus uniform sampling and $H\in\{0,1,\mathrm{NA}\}^{N_\text{old}\times N_q}$ 4 over the strongest active-inference baselines.

6. Benchmarking, Evaluation, and Empirical Results

FAQ was evaluated on historical outcomes for $H\in\{0,1,\mathrm{NA}\}^{N_\text{old}\times N_q}$ 5 LLMs and $H\in\{0,1,\mathrm{NA}\}^{N_\text{old}\times N_q}$ 6 questions, partitioned into two suites: MMLU-Pro ( $H\in\{0,1,\mathrm{NA}\}^{N_\text{old}\times N_q}$ 7) and a combined suite of BBH, GPQA, IFEval, MATH, MuSR ( $H\in\{0,1,\mathrm{NA}\}^{N_\text{old}\times N_q}$ 8). The historical-test split sets $H\in\{0,1,\mathrm{NA}\}^{N_\text{old}\times N_q}$ 9 models as historical $n_b$ 0 and $n_b$ 1 for testing. Experiments simulated various levels of missingness, ranging from fully observed to sparse ( $n_b$ 2 observed).

Baselines included uniform sampling (Wald intervals) and sequential active inference (AIPW) [Zrnic & Candes, 2024], with oracle-tuned labeling policies. At $n_b$ 3 query budget, the following empirical results were observed:

Method	ESS× (MMLU-Pro)	Coverage (MMLU-Pro)	ESS× (BBH+…)	Coverage (BBH+…)
Uniform	1.00	0.95	1.00	0.95
Best AIPW	1.8	0.94	2.2	0.94
FAQ	4.5	0.95	4.8	0.95

FAQ’s CI widths are about $n_b$ 4 those of uniform sampling at small budgets, with coverage consistently close to $n_b$ 5. Coverage remains stable across model vintages and true accuracies without systematic bias. A plausible implication is that FAQ offers substantial practical value in high-throughput LLM evaluation with constrained annotation budgets while maintaining rigorous statistical guarantees.

7. Conclusion and Significance

FAQ integrates a Bayesian logistic factor model, a hybrid Neyman/active-learning adaptive sampling policy, and a martingale-based inference engine via PAI, yielding efficient, robust, and statistically valid benchmarking of LLMs on large question banks. In comprehensive empirical evaluation, FAQ achieves up to $n_b$ 6 cost savings over uniform sampling at equivalent CI width, with minor computation and flexible adaptation to varying levels of historical data availability (Wu et al., 28 Jan 2026). Its methodology and codebase provide a reproducible and extensible framework for principled evaluation in the era of rapidly proliferating LLM variants.

Markdown Report Issue Upgrade to Chat

References (1)

Efficient Evaluation of LLM Performance with Statistical Guarantees (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Factorized Active Querying (FAQ).

Factorized Active Querying (FAQ)

1. Finite‐Population Inference in LLM Benchmarking

2. Bayesian Factor Model for Historical Data

3. Hybrid Variance-Reduction / Active-Learning Sampling Policy

Pseudocode

4. Proactive Active Inference and Confidence Interval Validity

5. Theoretical Properties and Effective Sample Size

6. Benchmarking, Evaluation, and Empirical Results

7. Conclusion and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Factorized Active Querying (FAQ)

1. Finite‐Population Inference in LLM Benchmarking

2. Bayesian Factor Model for Historical Data

3. Hybrid Variance-Reduction / Active-Learning Sampling Policy

Pseudocode

4. Proactive Active Inference and Confidence Interval Validity

5. Theoretical Properties and Effective Sample Size

6. Benchmarking, Evaluation, and Empirical Results

7. Conclusion and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research