BIG-bench Lite for LLM Evaluation
- BIG-bench Lite is an evaluation suite for large language models that uses a strategically selected subset of tasks to approximate full benchmark performance.
- It employs a four-layer MLP regression model and task embeddings to achieve high predictive fidelity across diverse model families.
- The methodology leverages data-driven subset selection and clustering to balance evaluation cost with comprehensive performance insights.
BIG-bench Lite (BBL) is an evaluation suite for LLMs that consists of a compact subset of BIG-bench tasks. BIG-bench Lite along with even smaller variants (“small-bench” suites) are designed to efficiently approximate the predictive fidelity of the full BIG-bench, enabling more scalable and representative evaluation of LLMs. The underlying methodology combines performance prediction using multi-layer perceptrons (MLPs), formal subset selection optimization, and clustering of learned task embeddings, establishing that LLM capabilities are highly predictable across model families when evaluated on a strategically chosen subset of tasks (Ye et al., 2023).
1. Overview of BIG-bench Lite and “Small-bench” Suites
BIG-bench (Beyond the Imitation Game Benchmark) comprises 313 diverse subtasks for LLM evaluation after filtering. Evaluating a new model across the full suite is costly and often redundant. BIG-bench Lite includes 42 subtasks, striking a balance between evaluative power and economy. Research demonstrates that task subsets as small as 8–16 are sufficient for recovering close to full-bench predictive fidelity, provided that tasks are selected using data-driven techniques rather than at random. These compact suites are called “small-bench” in the literature.
2. Performance Prediction Model for Task Subset Selection
Each experiment record is formalized as , where:
- is the model family (${BIG$-$G}_0$, ${BIG$-$G}_1$, BIG-G_sparse, PaLM, GPT-3, Gopher)
- denotes model size (number of parameters)
- specifies the subtask identity (out of 313 subtasks)
- is the number of in-context examples (0, 1, 2, 3, 5)
- is the normalized performance metric for that configuration
The regression objective is to learn , predicting task performance based on model and task features.
Featurization involves:
- One-hot encoding into six binary flags
- One-hot encoding pairs
- Six numerical features for (total parameters, non-embedding parameters, FLOP-matched parameters, and their logs)
- One-hot encoding over 313 subtasks
- standardized
The regression model is a four-layer MLP:
where is ReLU, dropout ≈0.1, hidden sizes are [256, 128, 64, 32], and sigmoid bounds predictions to .
The loss function is mean squared error:
In 10-fold cross-validation, this predictor achieves RMSE ≈ 0.05 and , indicating high accuracy and the presence of strong, learnable patterns in the data.
3. Subset Selection as a Constrained Optimization Problem
The search for an informative, compact subset (“small-bench”) is formalized as a subset-selection optimization. When a new model family is introduced, the objective is to select a training subset of size such that performance on all other tasks for the new family is maximally recoverable:
where measures variance explained, is the number of model families, and nested cross-validation (holding out one family at a time) prevents overfitting. The solution defines the optimal “small-bench” for a given budget .
4. Clustering-Based Construction Using MLP Task Embeddings
Task embeddings are extracted from the MLP as the first-layer weight vector connecting each subtask's one-hot encoding to the first hidden layer. For 's one-hot vector , the embedding is .
Given a bench size budget , the set is clustered via -means:
- Tasks are grouped into clusters to minimize intra-cluster distance.
- For each cluster , the nearest task to centroid is selected as .
- The set forms the cluster-based “small-bench.”
To improve informativeness, tasks are further ranked by “task value”—frequency of appearance in high-scoring random subsets (“Best of 5000”). Cluster selection is restricted to the top 25% by value, ensuring the final subset combines both diversity and task informativeness.
5. Empirical Evaluation and Predictive Fidelity
The following summarizes key empirical findings for “small-bench” construction:
| Suite | Subtasks () | 30x Averaged |
|---|---|---|
| Random | 8 | 0.65 |
| Best-of-5000 | 8 | 0.80 |
| -means | 8 | 0.75 |
| BIG-bench Hard | 24 | 0.75 |
| -means+value | 24 | 0.84 |
| BIG-bench Lite | 42 | 0.78 |
| -means+value | 42 | 0.88 |
The “-means+value” variants consistently outperform static subsets like BIG-bench Hard and BIG-bench Lite, reaching (24 tasks) and (42 tasks), meaning they recover $84$– of full-bench variance in held-out predictions with of the tasks.
In performance ranking tasks, comparison errors highlight inherent stochasticity: using BIG-bench Hard (24 tasks) to compare BIG-G₁ 2B vs. GPT-3 Large yields the wrong winner in 70% of cases, whereas a 24-task Best-of-5000 bench selects the correct winner 56% of the time and ties 25%—providing closer agreement with full-bench results.
6. Implications and Practical Considerations
The methodology demonstrates that LLM performance is highly predictable from configuration variables and can be reliably extrapolated using regression over informative task subsets. This suggests that evaluation suite design can move beyond manual curation and arbitrary subsetting towards principled, data-driven approaches that optimize for predictive sufficiency and efficiency.
A plausible implication is that for most practical large-scale benchmarking, large suites like BIG-bench can be replaced by compact, model-agnostic small benches that retain task diversity and informativeness via embedding-based clustering and value reweighting. This has direct benefits for benchmarking cost, scaling evaluations to new model families, and mitigating overfitting to outdated task baskets.
7. Relation to Other Benchmarks and Future Directions
BIG-bench Lite and “small-bench” methodologies contrast with static evaluation suites (e.g., BIG-bench Hard), which may underrepresent the diversity required for accurate generalization across model families. By leveraging latent task embeddings and cross-family validation, these methods help ensure robustness and minimize error in comparative evaluations.
Future directions include investigating the limits of predictability as model architectures further diversify, extending embedding and clustering workflows to new or evolving task sets, and formally integrating task subset selection into automated LLM evaluation pipelines. Task value estimation and dynamic task basket construction represent ongoing areas of methodological refinement (Ye et al., 2023).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free