Papers
Topics
Authors
Recent
Search
2000 character limit reached

Factorized Active Querying (FAQ)

Updated 4 February 2026
  • FAQ is a methodology for efficient finite-population inference in LLM benchmarking, using adaptive querying to maximize information gain.
  • It employs a Bayesian factor model and a hybrid variance-reduction/active-learning policy to construct tight, valid confidence intervals.
  • Empirical evaluations show up to 5× improvement in effective sample size over uniform sampling, reducing evaluation costs significantly.

Factorized Active Querying (FAQ) is a methodology for efficient finite-population inference in benchmarking LLMs, especially in settings where exhaustive evaluation across large question banks and many LLMs is impractical. FAQ constructs tight confidence intervals (CIs) for model accuracy leveraging historical performance through a Bayesian factor model, and adaptively samples evaluation queries with a hybrid variance-reduction/active-learning policy. Valid frequentist coverage is guaranteed by Proactive Active Inference (PAI)—a finite-population extension of active inference—that supports direct and adaptive query selection while controlling CI validity. FAQ delivers up to 5×5\times increases in effective sample size (ESS) over conventional uniform sampling, thereby achieving the same CI width with significantly fewer queries and negligible computational overhead (Wu et al., 28 Jan 2026).

1. Finite‐Population Inference in LLM Benchmarking

Let the question bank contain NqN_q items. For a new LLM, denote zj{0,1}z_j\in\{0,1\} as correctness on question jj, with finite-population accuracy defined as

θ=1Nqj=1Nqzj.\theta = \frac{1}{N_q} \sum_{j=1}^{N_q} z_j.

Historical model outcomes are encoded in H{0,1,NA}Nold×NqH\in\{0,1,\mathrm{NA}\}^{N_\text{old}\times N_q}, where many entries may be missing. Given a query budget nbn_b, the goal is to adaptively sample nbn_b questions, observe their zjz_j, and construct a (1α)(1-\alpha)-CI for θ\theta over the fixed bank.

Uniform random sampling yields unbiased accuracy estimates and Wald confidence intervals but fails to account for heterogeneity in question difficulty and informative signals from historical data. FAQ exploits these by modeling latent structure and selectively sampling queries to maximize information gain, thereby reducing estimator variance and increasing ESS. Empirically, FAQ attains the same CI width as uniform sampling with up to 5×5\times fewer queries.

2. Bayesian Factor Model for Historical Data

FAQ models historical outcomes HH via a low-rank Bayesian factor model. Each historical model ii is represented by a kk-dimensional latent proficiency uiRku_i\in\mathbb{R}^k; each question jj by a latent requirement vjRkv_j\in\mathbb{R}^k. The conditional probability of correctness is

Pr(Hij=1ui,vj)=σ(uivj),σ(x)=11+ex.\Pr(H_{ij}=1|u_i,v_j) = \sigma(u_i^\top v_j),\quad \sigma(x)=\frac{1}{1+e^{-x}}.

Parameter estimation involves minimizing the masked logistic loss with 2\ell_2 regularization: L({ui},{vj})=i,jOij[Hijlnσ(uivj)+(1Hij)ln(1σ(uivj))]+λ2(iui2+jvj2),L(\{u_i\},\{v_j\}) = -\sum_{i,j}O_{ij}\left[H_{ij}\ln\sigma(u_i^\top v_j) + (1-H_{ij})\ln(1-\sigma(u_i^\top v_j))\right] + \tfrac{\lambda}{2}\bigg(\sum_i\|u_i\|^2+\sum_j\|v_j\|^2\bigg), where Oij=1O_{ij}=1 iff HijH_{ij} is observed; (k,λ)(k,\lambda) are selected by cross-validation.

For a new model, uN(u^(0),Σ^(0))u\sim\mathcal{N}(\hat u^{(0)},\hat\Sigma^{(0)}) with empirical moments from historical uiu_i. After each query ItI_t, the approximate posterior uFt1N(u^(t1),Σ^(t1))u|\mathcal{F}_{t-1}\approx\mathcal{N}(\hat u^{(t-1)},\hat\Sigma^{(t-1)}) is updated using Laplace approximation: p^It(t1)=σ(u^(t1)vIt),w=p^(1p^), Σ^(t)=Σ^(t1)wΣ^(t1)vItvItΣ^(t1)1+wvItΣ^(t1)vIt, u^(t)=u^(t1)+Σ^(t)(zItp^It(t1))vIt.\begin{align*} \hat{p}^{(t-1)}_{I_t} &= \sigma(\hat u^{(t-1)\top}v_{I_t}), \quad w = \hat p(1-\hat p), \ \hat\Sigma^{(t)} &= \hat\Sigma^{(t-1)} - \frac{w\,\hat\Sigma^{(t-1)} v_{I_t} v_{I_t}^\top \hat\Sigma^{(t-1)}}{ 1 + w v_{I_t}^\top \hat\Sigma^{(t-1)} v_{I_t}}, \ \hat u^{(t)} &= \hat u^{(t-1)} + \hat\Sigma^{(t)} (z_{I_t} - \hat p^{(t-1)}_{I_t}) v_{I_t}. \end{align*}

3. Hybrid Variance-Reduction / Active-Learning Sampling Policy

At each round tt, FAQ computes a query distribution qtq_t over {1,,Nq}\{1,\dots,N_q\}, balancing:

  • Oracle variance-reduction: For a Bernoulli superpopulation, the minimal-variance sampling is q(j)pj(1pj)q^*(j)\propto \sqrt{p_j(1-p_j)}. In practice, use so(t)(j)=p^j(t1)(1p^j(t1))s^{(t)}_{\rm o}(j)=\sqrt{\hat p^{(t-1)}_j(1-\hat p^{(t-1)}_j)}.
  • Active-learning: Score based on estimated reduction in Var(θFt1)\text{Var}(\theta|\mathcal{F}_{t-1}); let sa(t)(j)s^{(t)}_{\rm a}(j) denote the one-step downward variance update (computed with Laplace/Delta approximation).

Normalizing, mixing, and tempering these scores leads to probabilities: ho(t)(j)=so(t)(j)iso(t)(i),ha(t)(j)=sa(t)(j)isa(t)(i)h^{(t)}_{\rm o}(j) = \frac{s^{(t)}_{\rm o}(j)}{\sum_i s^{(t)}_{\rm o}(i)},\quad h^{(t)}_{\rm a}(j) = \frac{s^{(t)}_{\rm a}(j)}{\sum_i s^{(t)}_{\rm a}(i)}

hcat(t)(j)=((1αt)ho(t)(j)+αtha(t)(j))βt.h^{(t)}_{\rm cat}(j) = \Big( (1-\alpha_t) h^{(t)}_{\rm o}(j) + \alpha_t h^{(t)}_{\rm a}(j) \Big)^{\beta_t}.

The final query distribution is

qt(j)=τNq+(1τ)Norm(hcat(t))(j).q_t(j) = \frac{\tau}{N_q} + (1-\tau)\,\mathrm{Norm}(h^{(t)}_{\rm cat})(j).

The adaptive weights αt\alpha_t, βt\beta_t, and τ\tau control exploration and exploitation. This composite policy actively concentrates queries where they are most informative for reducing uncertainty in model accuracy estimates, and for accelerating adaptation of the latent factor for the new LLM.

Pseudocode

1
2
3
4
5
6
7
8
9
10
11
12
13
Algorithm: Factorized Active Querying (FAQ)
Input: historical H, budget n_b, level α
1. Fit factor model on H → prior N(û⁽⁰⁾, Σ̂⁽⁰⁾)
2. for t=1…n_b do
     • Compute scores s_o, s_a from û⁽ᵗ⁻¹⁾, Σ̂⁽ᵗ⁻¹⁾
     • Form q_t via (3.3): mix/temper oracle & active
     • Sample I_t∼q_t; query z_{I_t}
     • Update û⁽ᵗ⁾,Σ̂⁽ᵗ⁾ via (2.3)–(2.4)
     • Compute φ_t = N_q⁻¹∑_j p̂_j⁽ᵗ⁻¹⁾ + N_q⁻¹ (z_{I_t}-p̂_{I_t}⁽ᵗ⁻¹⁾)/q_t(I_t)
   end for
3. Estimate θ̂= n_b⁻¹∑_t φ_t
4. Estimate variance σ̂² via martingale-variance formula
5. Return θ̂ ± z_{1-α/2}·σ̂/√n_b

4. Proactive Active Inference and Confidence Interval Validity

FAQ employs Proactive Active Inference (PAI) to ensure unbiasedness and CI coverage under finite-population, adaptive sampling. At round tt, the estimator

ϕt=1Nqj=1Nqp^j(t1)+1NqzItp^It(t1)qt(It)\phi_t = \frac{1}{N_q}\sum_{j=1}^{N_q} \hat p^{(t-1)}_j + \frac{1}{N_q} \frac{z_{I_t} - \hat p_{I_t}^{(t-1)}}{q_t(I_t)}

is a martingale difference. The accumulated average θ^nb=1nbt=1nbϕt\hat\theta_{n_b}=\frac{1}{n_b}\sum_{t=1}^{n_b}\phi_t is unbiased for θ\theta for any nbn_b.

The variance estimator is

σ^nb2=1nbNq2t=1nb(zItp^It(t1))2qt(It)21nbNq2t=1nb(1nbs=1nbzIsp^Is(s1)qs(Is))2.\hat\sigma_{n_b}^2 = \frac{1}{n_b N_q^2} \sum_{t=1}^{n_b} \frac{(z_{I_t} - \hat p_{I_t}^{(t-1)})^2}{q_t(I_t)^2} - \frac{1}{n_b N_q^2}\sum_{t=1}^{n_b} \left( \frac{1}{n_b}\sum_{s=1}^{n_b} \frac{z_{I_s} - \hat p_{I_s}^{(s-1)}}{q_s(I_s)} \right)^2.

A martingale CLT justifies reporting the Wald-type interval θ^nb±z1α/2σ^nb/nb\hat\theta_{n_b} \pm z_{1-\alpha/2} \hat\sigma_{n_b} / \sqrt{n_b}, yielding asymptotic (1α)(1-\alpha) coverage without independence or correct model assumptions.

5. Theoretical Properties and Effective Sample Size

  • The oracle minimal-variance policy (Theorem 3.1) in a Bernoulli superpopulation is q(j)pj(1pj)q^*(j)\propto\sqrt{p_j(1-p_j)}.
  • Under boundedness and statistical regularity, a martingale CLT (Theorem 4.1) applies:

nb(θ^nbθ)dN(0,σ2)\sqrt{n_b}(\hat\theta_{n_b}-\theta)\xrightarrow{d}\mathcal{N}(0,\sigma^2)

  • Effective sample size is defined as neff=σuniform2/σM2n_\mathrm{eff} = \sigma^2_\mathrm{uniform} / \sigma^2_\mathrm{M} for method M. FAQ achieves ESS multipliers of up to 5×5\times versus uniform sampling and 2.4×2.4\times over the strongest active-inference baselines.

6. Benchmarking, Evaluation, and Empirical Results

FAQ was evaluated on historical outcomes for 4.4K4.4\text{K} LLMs and 21.6K21.6\text{K} questions, partitioned into two suites: MMLU-Pro (Nq=12,032N_q=12,032) and a combined suite of BBH, GPQA, IFEval, MATH, MuSR (Nq=9,574N_q=9,574). The historical-test split sets 2.2K2.2\text{K} models as historical HH and 2.2K2.2\text{K} for testing. Experiments simulated various levels of missingness, ranging from fully observed to sparse (0.1%0.1\% observed).

Baselines included uniform sampling (Wald intervals) and sequential active inference (AIPW) [Zrnic & Candes, 2024], with oracle-tuned labeling policies. At nb=7.5%Nqn_b=7.5\%\cdot N_q query budget, the following empirical results were observed:

Method ESS× (MMLU-Pro) Coverage (MMLU-Pro) ESS× (BBH+…) Coverage (BBH+…)
Uniform 1.00 0.95 1.00 0.95
Best AIPW 1.8 0.94 2.2 0.94
FAQ 4.5 0.95 4.8 0.95

FAQ’s CI widths are about $1/5$ those of uniform sampling at small budgets, with coverage consistently close to 1α1-\alpha. Coverage remains stable across model vintages and true accuracies without systematic bias. A plausible implication is that FAQ offers substantial practical value in high-throughput LLM evaluation with constrained annotation budgets while maintaining rigorous statistical guarantees.

7. Conclusion and Significance

FAQ integrates a Bayesian logistic factor model, a hybrid Neyman/active-learning adaptive sampling policy, and a martingale-based inference engine via PAI, yielding efficient, robust, and statistically valid benchmarking of LLMs on large question banks. In comprehensive empirical evaluation, FAQ achieves up to 5×5\times cost savings over uniform sampling at equivalent CI width, with minor computation and flexible adaptation to varying levels of historical data availability (Wu et al., 28 Jan 2026). Its methodology and codebase provide a reproducible and extensible framework for principled evaluation in the era of rapidly proliferating LLM variants.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Factorized Active Querying (FAQ).