Papers
Topics
Authors
Recent
2000 character limit reached

BIG-bench Lite for LLM Evaluation

Updated 22 November 2025
  • BIG-bench Lite is an evaluation suite for large language models that uses a strategically selected subset of tasks to approximate full benchmark performance.
  • It employs a four-layer MLP regression model and task embeddings to achieve high predictive fidelity across diverse model families.
  • The methodology leverages data-driven subset selection and clustering to balance evaluation cost with comprehensive performance insights.

BIG-bench Lite (BBL) is an evaluation suite for LLMs that consists of a compact subset of BIG-bench tasks. BIG-bench Lite along with even smaller variants (“small-bench” suites) are designed to efficiently approximate the predictive fidelity of the full BIG-bench, enabling more scalable and representative evaluation of LLMs. The underlying methodology combines performance prediction using multi-layer perceptrons (MLPs), formal subset selection optimization, and clustering of learned task embeddings, establishing that LLM capabilities are highly predictable across model families when evaluated on a strategically chosen subset of tasks (Ye et al., 2023).

1. Overview of BIG-bench Lite and “Small-bench” Suites

BIG-bench (Beyond the Imitation Game Benchmark) comprises 313 diverse subtasks for LLM evaluation after filtering. Evaluating a new model across the full suite is costly and often redundant. BIG-bench Lite includes 42 subtasks, striking a balance between evaluative power and economy. Research demonstrates that task subsets as small as 8–16 are sufficient for recovering close to full-bench predictive fidelity, provided that tasks are selected using data-driven techniques rather than at random. These compact suites are called “small-bench” in the literature.

2. Performance Prediction Model for Task Subset Selection

Each experiment record is formalized as (,nparam,t,nshot,y)(\ell, n_{\mathrm{param}}, t, n_{\mathrm{shot}}, y), where:

  • \ell is the model family (${BIG$-$G}_0$, ${BIG$-$G}_1$, BIG-G_sparse, PaLM, GPT-3, Gopher)
  • nparamn_{\mathrm{param}} denotes model size (number of parameters)
  • tt specifies the subtask identity (out of 313 subtasks)
  • nshotn_{\mathrm{shot}} is the number of in-context examples (0, 1, 2, 3, 5)
  • y[0,1]y \in [0,1] is the normalized performance metric for that configuration

The regression objective is to learn y^=f(,nparam,t,nshot)\hat{y} = f(\ell, n_{\mathrm{param}}, t, n_{\mathrm{shot}}), predicting task performance based on model and task features.

Featurization involves:

  • One-hot encoding \ell into six binary flags
  • One-hot encoding (,nparam)(\ell, n_{\mathrm{param}}) pairs
  • Six numerical features for nparamn_{\mathrm{param}} (total parameters, non-embedding parameters, FLOP-matched parameters, and their logs)
  • One-hot encoding tt over 313 subtasks
  • nshotn_{\mathrm{shot}} standardized

The regression model is a four-layer MLP:

h1=Dropout(σ(W1x+b1)) h2=Dropout(σ(W2h1+b2)) h3=Dropout(σ(W3h2+b3)) y^=sigmoid(W4h3+b4)\begin{align*} & h^1 = \text{Dropout}(\sigma(W^1 x + b^1)) \ & h^2 = \text{Dropout}(\sigma(W^2 h^1 + b^2)) \ & h^3 = \text{Dropout}(\sigma(W^3 h^2 + b^3)) \ & \hat{y} = \text{sigmoid}(W^4 h^3 + b^4) \end{align*}

where σ\sigma is ReLU, dropout ≈0.1, hidden sizes are [256, 128, 64, 32], and sigmoid bounds predictions to [0,1][0,1].

The loss function is mean squared error:

L(θ)=1Ni(y^iyi)2L(\theta) = \frac{1}{N}\sum_i (\hat{y}_i - y_i)^2

In 10-fold cross-validation, this predictor achieves RMSE ≈ 0.05 and R2>0.95R^2 > 0.95, indicating high accuracy and the presence of strong, learnable patterns in the data.

3. Subset Selection as a Constrained Optimization Problem

The search for an informative, compact subset (“small-bench”) is formalized as a subset-selection optimization. When a new model family test\ell_{\mathrm{test}} is introduced, the objective is to select a training subset TtrainTT_{\mathrm{train}} \subset T of size bb such that performance on all other tasks for the new family is maximally recoverable:

T=argmaxTT,T=b1Kk=1KR2((TT)×{k})T^* = \arg\max_{T' \subseteq T, |T'| = b} \frac{1}{K}\sum_{k=1}^K R^2\big((T \setminus T') \times \{ \ell_k \} \big)

where R2R^2 measures variance explained, K=6K=6 is the number of model families, and nested cross-validation (holding out one family at a time) prevents overfitting. The solution TT^* defines the optimal “small-bench” for a given budget bb.

4. Clustering-Based Construction Using MLP Task Embeddings

Task embeddings are extracted from the MLP as the first-layer weight vector Wt1W^1_t connecting each subtask's one-hot encoding to the first hidden layer. For tt's one-hot vector et{0,1}Te_t \in \{0,1\}^{|T|}, the embedding is ht=W1etRdh_t = W^1 e_t \in \mathbb{R}^d.

Given a bench size budget bb, the set {ht}\{h_t\} is clustered via kk-means:

  • Tasks are grouped into bb clusters C1,,CbC_1,\ldots,C_b to minimize intra-cluster L2L^2 distance.
  • For each cluster CkC_k, the nearest task to centroid μk\mu_k is selected as tk=argmintCkhtμkt_k = \arg\min_{t \in C_k} \| h_t - \mu_k \|.
  • The set {t1,...,tb}\{t_1, ..., t_b\} forms the cluster-based “small-bench.”

To improve informativeness, tasks are further ranked by “task value”—frequency of appearance in high-scoring random subsets (“Best of 5000”). Cluster selection is restricted to the top 25% by value, ensuring the final subset combines both diversity and task informativeness.

5. Empirical Evaluation and Predictive Fidelity

The following summarizes key empirical findings for “small-bench” construction:

Suite Subtasks (bb) 30x Averaged R2R^2
Random 8 0.65
Best-of-5000 8 0.80
kk-means 8 0.75
BIG-bench Hard 24 0.75
kk-means+value 24 0.84
BIG-bench Lite 42 0.78
kk-means+value 42 0.88

The “kk-means+value” variants consistently outperform static subsets like BIG-bench Hard and BIG-bench Lite, reaching R20.84R^2 \approx 0.84 (24 tasks) and R20.88R^2 \approx 0.88 (42 tasks), meaning they recover $84$–88%88\,\% of full-bench variance in held-out predictions with 20%\ll 20\% of the tasks.

In performance ranking tasks, comparison errors highlight inherent stochasticity: using BIG-bench Hard (24 tasks) to compare BIG-G₁ 2B vs. GPT-3 Large yields the wrong winner in 70% of cases, whereas a 24-task Best-of-5000 bench selects the correct winner 56% of the time and ties 25%—providing closer agreement with full-bench results.

6. Implications and Practical Considerations

The methodology demonstrates that LLM performance is highly predictable from configuration variables and can be reliably extrapolated using regression over informative task subsets. This suggests that evaluation suite design can move beyond manual curation and arbitrary subsetting towards principled, data-driven approaches that optimize for predictive sufficiency and efficiency.

A plausible implication is that for most practical large-scale benchmarking, large suites like BIG-bench can be replaced by compact, model-agnostic small benches that retain task diversity and informativeness via embedding-based clustering and value reweighting. This has direct benefits for benchmarking cost, scaling evaluations to new model families, and mitigating overfitting to outdated task baskets.

7. Relation to Other Benchmarks and Future Directions

BIG-bench Lite and “small-bench” methodologies contrast with static evaluation suites (e.g., BIG-bench Hard), which may underrepresent the diversity required for accurate generalization across model families. By leveraging latent task embeddings and cross-family validation, these methods help ensure robustness and minimize error in comparative evaluations.

Future directions include investigating the limits of predictability as model architectures further diversify, extending embedding and clustering workflows to new or evolving task sets, and formally integrating task subset selection into automated LLM evaluation pipelines. Task value estimation and dynamic task basket construction represent ongoing areas of methodological refinement (Ye et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to BIG-bench Lite (BBL).