Multi-Subject Benchmark Subsets

Updated 17 November 2025

Multi-Subject Subsets are a systematic approach that defines and selects domain-specific evaluation segments from large benchmarks using multi-label annotations.
They employ stratified, random, and cognitive embedding-based sampling methods to achieve balanced coverage and reduce evaluation costs.
These subsets enable detailed performance analysis for LLMs across diverse domains, fostering improved interpretability and fairness in model assessments.

Multi-Subject Subsets of Standard Benchmarks encompass systematic methodologies for defining, constructing, and evaluating portions of large-scale LLM benchmark suites that span multiple subject domains. As LLMs are increasingly tasked with diverse and cross-disciplinary evaluations, this paradigm enables fine-grained, customizable, and cost-effective testing while preserving the interpretability and statistical rigor of aggregate performance metrics. Multi-subject subset principles, now integrated across core benchmark initiatives, underpin both practical efficiency and methodological transparency in model evaluation.

1. Formalization and Taxonomy

A multi-subject subset is formally defined by a selector function applied to a repository of benchmark questions, each labeled with subject, skill, and other metadata. For a unified benchmark $B=\{q_1, ..., q_N\}$ with subject set $D=\{d_1, ..., d_M\}$ , and question-level mappings:

$\mathrm{subject}(q)\subset D$ (multi-label subject annotation)
$\mathrm{skill}(q)\in\{\mathrm{knowledge},\mathrm{reasoning},\mathrm{value/alignment}\}$
$\mathrm{target}(q)\in\{\mathrm{general},\mathrm{ko},\mathrm{us},...\}$
$\mathrm{task\_format}(q)\in\{\mathrm{binary},\mathrm{MCQA},\mathrm{short},...\}$

The formal extraction function for multi-subject subset $S\subset B$ for chosen subjects $\{d_1, ..., d_k\}$ is:

$S(\{d_1,...,d_k\}) = \{ q\in B \mid \mathrm{subject}(q)\cap\{d_1,...,d_k\}\neq\emptyset \}$

For joint filters (subject, skill, target):

$S_{\mathrm{subjects,skills,targets}} = \{ q\in B \mid \mathrm{subject}(q)\cap S_{\mathrm{sub}}\neq\emptyset \land \mathrm{skill}(q)\in S_{\mathrm{skill}} \land \mathrm{target}(q)\in S_{\mathrm{target}} \}$

BenchHub (Kim et al., 31 May 2025) defines a hierarchy—task, answer format, skill, subject, target—allowing robust indexing and O(1) filtering even for datasets exceeding 300K examples.

2. Subset Construction and Sampling Algorithms

Construction of multi-subject subsets involves complex filtering and stratification procedures to ensure fair domain coverage, difficulty balancing, and sampling efficiency. Extraction pseudocode typically follows:

def extract_subset(B, subject_filter, skill_filter, target_filter, sampling, n_per_dataset, n_total):
    C = filter_questions(B, subject_filter, skill_filter, target_filter)
    if sampling == "random":
        S = uniform_sample(C, size=n_total)
    elif sampling == "stratified":
        S = []
        for D_i in group_by_dataset(C):
            S.extend(random_sample(D_i, size=n_per_dataset))
    elif sampling == "category_distribution":
        for cat_j, p_j in ref_proportions():
            S_j = uniform_sample([q for q in C if cat_j in subject(q)], int(p_j * n_total))
            S.extend(S_j)
    return S

BenchHub leverages a multi-task classifier (BenchHub-Cat-7B) for sample-level labeling with empirical accuracies of 87.1% (subject), 96.7% (skill), and 49.4% (target). Subset construction functions expose clear API endpoints and support dynamic expansion and real-time queries.

Sampling mode critically affects evaluation fairness and model ranking stability. For instance, stratified sampling by subject prevents overfitting to dominant domains, while proportion-based strategies (MixEval-like) maintain aggregate distribution fidelity (Kim et al., 31 May 2025).

3. Item-Centric vs Model-Centric Selection: Cognitive Embeddings

Scales++ (Bean et al., 30 Oct 2025) proposes an item-centric methodology for subset selection, eschewing historical model score matrices. Each benchmark item $t_i$ is mapped via LLM prompting to a cognitive-demand vector $C_i\in\mathbb{R}^{16}$ , spanning axes derived from cognitive science (e.g., logical reasoning, knowledge subtype, quantitative reasoning). UMAP and $k$ -means clustering are then used to select $k$ items, ensuring inherent coverage of all subject/task regions with minimal ad hoc balancing.

For a subset size $k$ , the predictors are:

Cluster-weighted average estimator:

$\hat{S}_m^{(1)} = \sum_{j=1}^k \frac{|\mathcal{C}_j|}{N}M(\phi_m, c_j)$

Dimension-wise logistic regression estimator:

$\hat{S}_m^{(2)} = \sum_{i=1}^N \frac{1}{D}\sum_{d=1}^D \mathrm{Pr}_d(\mathrm{score}=1 \mid C_i[d])$

The final estimate is a linear combination with optimally determined weights. Scales++ achieves $<$ 3\% MAE for benchmark score prediction with 0.5\% sampling—an 18 $\times$ reduction in upfront cost compared to IRT++.

This approach guarantees breadth—since selected items naturally represent the cognitive, subject, and task diversity present in the benchmark.

4. Multi-Subject Heterogeneity and Reasoning Integration

Curated multi-subject subsets are crucial for evaluating systems designed for heterogeneous reasoning, such as S-DAG (Dong et al., 10 Nov 2025). Standard benchmarks often annotate questions with a single subject, obscuring interdisciplinary complexities.

S-DAG constructs multi-subject slices from benchmarks (MMLU-Pro, GPQA, MedMCQA) by prompting LLMs to assign subject relevance weights for each question. Heterogeneity is quantified as

$H(q) = 1 - \max_{s\in S} P(s|q)$

with curated subsets achieving $\mathrm{Avg}\ \#\ \text{Subjects/Q}\approx 3.9$ and $\mathrm{Avg}\ H(q)\approx 0.7$ .

Subject–based DAGs are constructed with edges connecting supporting to dominant subjects, reflecting latent dependency graphs for reasoning. This method reduces baseline chain-of-thought accuracy by 15–20 points (relative to single-subject evaluation), illustrating increased complexity. Experimentally, S-DAG achieves higher accuracies and halved inference latency versus unconstrained multi-agent collaboration baselines.

5. Difficulty Labeling and Balanced Coverage

Easy2Hard-Bench (Ding et al., 27 Sep 2024) demonstrates multi-subject subset construction across six benchmarks with item-wise numerical difficulty annotated via IRT or Glicko-2 models. Each problem is tagged with a $(\mathrm{domain}, \mathrm{diff})$ pair. Filtering functions select by subject and difficulty tier (easy, medium, hard) to build mixes such as

1	multi = pd.concat([select(dom, 'med', 10) for dom in df.domain.unique()])

Balanced multi-subject subsets can be constructed to optimize for uniform domain and tier representation, progressive difficulty curriculum, or Pareto-efficiency. For example, with six domains and three difficulty tiers, a 6–18 item subset can preserve coverage, enable cross-domain generalization checks, and foster curriculum fine-tuning experiments.

Percentage breakdowns by tier and domain facilitate explicit tuning of coverage versus subset size, critical for both benchmarking and fair generalization measurements.

6. Performance Matrix–Based Subset Discovery and Statistical Fidelity

SimBA (Subramani et al., 20 Oct 2025) abstracts the benchmark as a performance matrix $M\in\mathbb{R}^{n\times m}$ (subjects by models), seeking minimal subject subsets $S$ that achieve threshold coverage:

$\mathrm{coverage}(S) = \frac{1}{m} \sum_{j=1}^m \max_{i\in S} M_{i,j}$

Empirically, for MMLU and HELM, a single subject suffices for $\geq95\%$ coverage ( $S^{*}=\{\mathrm{Professional Law}\}$ for MMLU). Greedy representative-subset algorithms exploit pairwise Pearson/Minkowski similarity among subjects to select $S$ . Rank preservation (Spearman $\rho > 0.95$ ) and near–zero prediction error on held-out models confirm that judicious subset selection preserves aggregate benchmark orderings, dramatically reducing evaluation and model selection cost.

Benchmarks with higher subject independence (BigBenchLite) require more subjects to achieve similar coverage, underscoring the importance of multi-subject structure analysis for each new dataset.

7. Benchmark-Specific Designs and Cross-Lingual Considerations

LHMKE (Liu et al., 19 Mar 2024) exemplifies the design of holistic, multilingual multi-subject benchmarks. Sampling covers 30 subjects and 75 tasks spanning the entire Chinese education spectrum (primary to professional certification):

Standardized score–based scaling for inter-subject comparability.
Balanced objective vs subjective question distribution ( $p_{\mathrm{obj}}\approx0.75$ , $p_{\mathrm{sub}}\approx0.25$ ).
Per-subject caps and complete-paper inclusion policies maintain authentic exam representation.

Compared to previous Chinese benchmarks focused mainly on multiple-choice formats, LHMKE introduces subjective assessment and broadens coverage, filling gaps in earlier datasets. This multi-subject design enables direct probing of knowledge acquisition and expression capabilities across diverse linguistic and cognitive domains, with statistical breakdowns ensuring defensible group and difficulty balancing.

8. Interpretability, Statistical Testing, and Practical Recommendations

Empirical findings across benchmarks such as BenchHub, Scales++, S-DAG, Easy2Hard-Bench, and SimBA demonstrate that:

Model leaderboard positions and rankings are highly sensitive to subject composition.
Sampling strategy and subset definition materially affect measured accuracy, ranking stability, and statistical significance (e.g., Friedman $\chi^2$ , Wilcoxon signed-rank $p<0.01$ (Kim et al., 31 May 2025)).
Multi-subject slices facilitate stress testing of routing, mixture-of-experts, and multi-agent debate systems, revealing robustness and fairness properties absent in single-domain slices.
Algorithmic recipes (extract_subset, cluster–weighted estimators, coverage–based greedy selection) are generalizable across benchmarks; practitioners should empirically verify rank preservation and predictive fidelity before subset deployment.

A plausible implication is that cross-domain generalization, curriculum design, and adaptive model routing are best studied on carefully constructed multi-subject subsets with explicit balancing, difficulty annotation, and statistical coverage guarantees.

Table: Multi-Subject Subset Construction Method Summary

Benchmark / Method	Multi-Subject Selection Principle	Coverage / Fidelity Summary
BenchHub (Kim et al., 31 May 2025)	Filter on subjects/skills; stratified/random sampling; taxonomy-based indexing	Enables sub-second queries; rankings vary with slice
Scales++ (Bean et al., 30 Oct 2025)	Cluster in cognitive-demand embedding space; item-centric	$<$ 3% MAE at 0.5% subset; covers all subject axes
S-DAG (Dong et al., 10 Nov 2025)	LLM-assigned subject weights; heterogeneity cap; DAG construction	Avg. 3.9 subjects per Q; 15–20 pt accuracy drop
Easy2Hard-Bench (Ding et al., 27 Sep 2024)	Domain × difficulty filtering; binning and curriculum strategies	Balanced subsets, curriculum fine-tuning possible
SimBA (Subramani et al., 20 Oct 2025)	Performance-matrix greedy selection for coverage	Single subject yields $>$ 95% coverage on MMLU/HELM
LHMKE (Liu et al., 19 Mar 2024)	Score-based scaling; group-level caps; objective/subj. split	Broadest coverage across Chinese educational spectrum

Multi-subject subsets represent a foundational infrastructure for rigorous, interpretable, and cost-effective LLM benchmarking, enabling both wide subject coverage and precise insight into model generalization, robustness, and reasoning capacity across heterogeneous domains.