SelectivBench: ML Selectivity Benchmark

Updated 25 January 2026

SelectivBench is a comprehensive suite of benchmarking methodologies and datasets designed to rigorously evaluate selectivity mechanisms in machine learning models.
It offers controlled experiments for sequence modeling, selective classification, and subset selection, emphasizing reproducible comparisons and modular extensibility.
Its protocols include precise metrics and architectural insights, enabling rigorous analysis of memory, context-conditioning, distractor rejection, and transfer learning.

SelectivBench is a term denoting a suite of benchmarking methodologies and datasets designed for rigorous evaluation of selectivity mechanisms in machine learning models. Multiple recent studies independently introduce frameworks under the "SelectivBench" name, primarily targeting sequence modeling, selective classification, and subset selection in optimization. SelectivBench emphasizes controlled, reproducible comparisons, modularity, and coverage over domain complexity, offering lightweight synthetic constructs, curated natural data, and extensible codebases suitable for high-throughput experimentation and architectural dissection (Bouhadjar et al., 18 Jan 2026, Pugnana et al., 2024, Shang et al., 2022).

1. Formal Construction and Task Taxonomy

SelectivBench, as presented for sequence modeling, centers around rule-based artificial grammars implemented via Markov chains over latent states $\mathcal{Z}$ , partitioned into observable symbols $\mathcal{S}$ , with configurable ambiguity $A(s)=A$ . Four primary evaluation tasks are defined (Bouhadjar et al., 18 Jan 2026):

Memorization for Disambiguation: The model reconstructs the full latent state trajectory $z_{1:T}$ given ambiguous observable tokens $s_{1:T}$ . Accuracy is measured as

$\mathrm{Acc} \;=\;\frac{1}{T}\sum_{t=1}^T \mathbf{1}\{y_t = z_t\}.$

Noise Rejection Selectivity: True tokens are interspersed with gaps consisting of dense random noise $e_{t,i}\sim\mathrm{Unif}[0,\gamma]$ between every $s_t$ , requiring the model to ignore gap elements and output persistent labels over them.
Context-aware Selectivity: Gaps are populated with tokens that are valid in the vocabulary but forbidden by the transition matrix $\tau$ , necessitating context-conditional rejection based on grammar.
Length Generalization: After training on moderate gap lengths, models are tested with gap sizes far exceeding those seen during training, probing architectural capacity for extrapolation.

Complexity is parameterized by vocabulary size $|\mathcal{S}|$ , partition depth $D$ , gap range $(n_{\min}, n_{\max})$ , noise amplitude $\gamma$ , and gap probability $G$ .

2. Evaluation Protocols and Metrics

All SelectivBench instantiations adhere to systematic evaluation procedures, balancing scalability with statistical rigor. For the sequence modeling suite, experiments utilize 200,000 training and 1,000 test sequences per grammar, with sequence lengths up to $T_{\max}=200$ . Metrics center on token-wise classification accuracy, drop in accuracy upon introduction of noise/distractors, and performance decay as a function of unseen gap length.

Selective classification SelectivBench (Pugnana et al., 2024) implements additional metrics:

Coverage: Fraction of test inputs for which predictions are made:

$\mathrm{Coverage}(\phi) = \frac{1}{n}\sum_{i=1}^n \phi(X_i).$

Selective Risk: Error rate on selected predictions:

$\mathrm{Risk}(\phi) = \frac{\sum_{i=1}^n \phi(X_i)\mathbf{1}\{f(X_i)\neq Y_i\}}{\sum_{i=1}^n \phi(X_i)}.$

Coverage-Violation, Class-Rejection Histograms, OOD Coverage: Refinements to quantify overabstention, class bias in rejection, and robustness to out-of-distribution data.

Subset selection SelectivBench (Shang et al., 2022) quantifies solution quality by hypervolume (HV), IGD/IGD $^+$ , uniformity level $U(S)$ , and wall-clock runtime.

3. Architectural and Methodological Insights

The sequence SelectivBench framework (Bouhadjar et al., 18 Jan 2026) reveals core principles governing selective processing in linear recurrent models (LRMs):

Gating plus rapid forgetting enables effective recall and rejection of irrelevant or stale content, crucial for handling noisy gaps.
Complementary gating (as in Mamba/Mamba2) coordinates write and forget operations, central to context-aware distractor suppression.
Channel mixing (multi-Householder, low-rank updates) is not necessary for raw selectivity but significantly benefits generalization and memory over longer gaps.
Softmax attention outperforms engineered LRMs in short-sequence settings, but its quadratic scaling renders it impractical beyond moderate lengths.

Final test accuracy across major LRM and Transformer baselines on the hardest selectivity tasks is tabulated below, illustrating performance sensitivity to gating, forgetting, and mixing mechanisms:

Model	Task 1	Task 2	Task 3
DeltaNet	50%	41%	35%
GLA	75%	68%	50%
Mamba	89%	71%	63%
Mamba2	92%	67%	64%
Gated DeltaNet	87%	76%	57%
Gated DeltaProduct	87%	74%	61%
Transformer	86%	77%	67%

4. Practical Implementation and Extensibility

SelectivBench selective classification suite (Pugnana et al., 2024) comprises 44 datasets (22 image, 22 tabular), 18 baseline methods, and a modular PyTorch codebase with extensibility via abstract base classes and registries. Coverage calibration, surrogate-risk minimization, conformal prediction, and Monte Carlo dropout are implemented for apples-to-apples comparison.

Performance plots, risk curves, coverage satisfaction, rejection bias, and OOD robustness are produced through standardized reporting routines. Practitioners can add new datasets or methods via concise subclass and registry definitions and have access to configuration via Hydra and hyperparameter tuning via Optuna. The code is open-source under MIT license.

5. Connection to Subset Selection and Transfer Learning

SelectivBench also references subset selection benchmarks for evolutionary multi-objective optimization (Shang et al., 2022), wherein candidate solutions (non-dominated archives or sampled Pareto fronts) are distilled via ten representative subset selection algorithms. Metrics probe trade-offs among hypervolume maximization, IGD minimization, uniformity, and computational tractability, supporting recommendations tailored to specific optimization objectives and front shapes.

In transfer learning, SelectivBench (Deshpande et al., 2021) aggregates a model zoo (single-/multi-domain experts) and diverse target tasks to compare model selection proxies, with ranking correlation, selection efficiency, and fine-tuning gain as key criteria. Label–Feature Correlation (LFC) and Label–Gradient Correlation (LGC) outperform prior baselines in accuracy and selection efficiency, especially in low-shot domains.

6. Significance and Context

SelectivBench facilitates dissecting selectivity mechanisms across architectures (LRMs, Transformers, selective classifiers, optimization subset selectors), supports reproducible benchmarking under well-defined complexity controls, and enables fair statistical comparisons free of dataset- or class-specific selection bias. This modularity and transparency foster deeper understanding of memory, context-conditioning, abstention dynamics, and transferability, yielding insights translatable to large-scale tasks while requiring only modest computational resources. The codebases associated with SelectivBench provide reference implementations and extensibility points for broader community adoption (Bouhadjar et al., 18 Jan 2026, Pugnana et al., 2024, Shang et al., 2022, Deshpande et al., 2021).