Model Capability Scoring

Updated 19 November 2025

Model capability scoring is a suite of methodologies that quantitatively assesses machine learning models’ latent abilities using formal definitions, multidimensional taxonomies, and automated testing.
It integrates mechanism-based metrics such as the Model Utility Index and statistical scoring rules to enable robust, fair, and interpretable inter-model comparisons.
The framework supports practical applications like representative subset construction, dynamic model routing, and domain-specific evaluations, guiding future research and optimization.

Model capability scoring is the suite of methodologies for quantitatively assessing the breadth, reliability, and character of the latent abilities exhibited by machine learning models, particularly large foundation models and LLMs. The field spans formal operationalizations of what it means for a model to "possess" a capability, multidimensional capability taxonomies, mechanism-driven neural criteria, data-driven and programmatic evaluation procedures, and the development of efficient, fair, and generalizable scoring pipelines. Recent research has also introduced automated discovery, ranking, and probing methodologies that minimize both human effort and data redundancy.

1. Formal Definitions of Model Capability

A rigorous definition of model capability is essential for principled scoring. The Conditional Analysis of Model Abilities (CAMA) provides a formalism: for a model $M$ , capability $\varphi$ , operationalization construct $c$ (query distribution $D_{c}$ , evaluation criterion $s(q, y)$ ) and background conditions $B$ (e.g., prompt template, decoding parameters), $M$ is said to “have” the capability $\varphi_{c}$ if

$\exists B:\ \Pr_{q \sim D_{c}}[\mathrm{Succ}_{\varphi,c}(q, Y)\,|\,\mathrm{Try}_{\varphi,c}(q, Y; B)] \geq \tau,$

for some reliability threshold $\tau$ (e.g., 0.9) and where $\mathrm{Try}_{\varphi,c}(q, y; B)$ indicates that $M$ , when producing $y$ to $q$ under $B$ , is best characterized as “trying” to execute $\varphi$ (Harding et al., 14 May 2024).

Operationalizing “trying” is realized by stability/sensitivity tests to perturbations: teleological criteria ensure that a model only “tries” $\varphi$ if its behavior is robust to $\varphi$ -irrelevant changes and sensitive to $\varphi$ -relevant changes. Capability scoring in this setting involves sampling background configurations $B$ and reporting the maximum empirical probability (with confidence intervals) of successful “tried” executions.

This framework enables model- and context-conditional scoring, rigorous quantification of capability boundaries, and principled inter-model comparisons that are robust to spurious correlations or spurious robustness.

2. Mechanism-Based Scoring: Model Utility Index

Beyond behavioral success rates, mechanism-centric metrics such as the Model Utility Index (MUI) quantify the coverage of a model’s internal latent structure during evaluation. Formally, for a test sample $T$ , MUI is defined as

$\mathrm{MUI}(T) = \frac{N_{\mathrm{activated}}(T)}{N_{\mathrm{total}}},$

where $N_{\mathrm{activated}}(T)$ is the number of neurons (or latent units) deemed “activated” beyond a threshold during $M$ ’s processing of $T$ (Wang et al., 13 Aug 2025).

In practical transformer models, the neuron-based MUI employs interpretability scores that track per-layer, per-neuron activations exceeding a top- $k$ percentile threshold. For a test set $\mathcal{T}$ ,

$\mathrm{MUI}_{\mathrm{neuron}}(\mathcal{T}) = \frac{|\cup_{T \in \mathcal{T}} N_{\mathrm{activated}}(T)|}{N \cdot L},$

quantifies the fraction of the model’s entire neuron inventory exercised by the evaluation set.

Unlike accuracy or loss, MUI is independent of the model’s prediction correctness; statistical testing confirms that MUI distributions are indistinguishable between correct and incorrect outputs ( $p \gg 0.05$ ). This insulates mechanism-based sampling from bias toward task difficulty or ease.

Representative subset selection is cast as a Maximum Coverage Problem, with a $(1 - 1/e)$ -approximate greedy algorithm selecting diverse, capability-maximizing test examples that preserve model ranking with high Spearman/Kendall correlation, even with as little as 5–10% of the original evaluation data (Wang et al., 13 Aug 2025).

3. Multidimensional Capability Taxonomies and Dataset Coverage Metrics

High-fidelity scoring demands fine-grained, structured capability spaces. The CDT (Cognition–Domain–Task) framework constructs a three-axis taxonomy: cognition ( $\mathcal{C}$ : 18 CHC-inspired cognitive skills), domain ( $\mathcal{D}$ : 33 subdomains), and task type ( $\mathcal{T}$ : 16 canonical formats) (Mo et al., 29 Sep 2025). Each instruction is tagged with a triplet $(c, d, t) \in \mathcal{C} \times \mathcal{D} \times \mathcal{T}$ .

Capability coverage and balance are quantified as: $\mathrm{COV}(D) = \frac{|T|}{|\mathcal{F}|} \qquad \mathrm{BAL}(D) = -\sum_{f \in T} p(f)\,\log p(f)$ where $T$ is the set of unique $(c,d,t)$ observed in dataset $D$ and $\mathcal{F}$ is the full $18\times33\times16=9504$ capability space.

Normalized versions (dividing by $\log N$ per axis) enable comparability across model scales and corpora. Unified scalar scores, combining coverage and normalized entropy terms across the three axes, have been shown to correlate tightly with downstream model performance (Spearman $\rho$ ~ 0.75–0.85) and support precise data diversification and focused fine-tuning (Mo et al., 29 Sep 2025).

4. Automated and Active Capability Discovery

Automated Capability Discovery (ACD) and Active learning for Capability Evaluation (ACE) frameworks automate the process of discovering and scoring the full capability surface (Lu et al., 11 Feb 2025, Afkanpour et al., 22 May 2025). In both, a powerful LLM (the “scientist”) decomposes target domains into capability descriptors, generates evaluation tasks, and aggregates results.

In ACE, capabilities are embedded into a latent semantic space. Model performance is treated as a function $f(z)$ (where $z$ is the capability embedding), learned and actively sampled via Bayesian optimization (Gaussian process regression, MacKay ALM and Cohn ALC acquisition functions). This enables computationally efficient discovery of unresolved or failure surfaces, moving beyond static, hand-engineered benchmarks (Afkanpour et al., 22 May 2025).

ACD embraces open-endedness by evolving an archive of novel, LLM-generated task families, subjecting each to both programmatic and model-based scoring, and clustering observed task successes/failures via t-SNE + HDBSCAN into capability areas. Evaluation leans on binary scoring at both the subtask and cluster levels, providing a per-model vector of capability strengths normalized to $[0,1]^K$ . Agreement with human raters (F1~0.86, >92% validity) supports the reliability of this approach for large-scale, scalable model capability mapping (Lu et al., 11 Feb 2025).

5. Domain-Specific and Multimodal Capability Scoring

Specialized domains, particularly safety- and accuracy-critical fields such as medical question answering, require tailored capability scoring pipelines. ACE- $M^3$ exemplifies a modular, multimodal architecture that decomposes evaluation into Expression, Medical Knowledge Correctness, and Patient Question Relevance branches, with composite scoring via a weighted aggregation of sub-domain scores. Training employs reward-token-based direct preference optimization (RTDPO) for efficient, high-granularity preference modeling, achieving high human alignment and performance superiority over both open- and closed-source baselines in medical assessment (Zhang et al., 16 Dec 2024).

This modular, synthesizing branch-merge structure, along with domain-adapted annotation and vision encoders, supports transfer to other specialized contexts and demonstrates the flexibility of capability-centric scoring approaches.

6. Scoring Rules, Calibration, and Fair Model Comparison

Classical model scoring functions (squared error, Brier, logarithmic, spherical, CRPS) play a central role in capability assessment, especially for probabilistic or distributional predictions (Machete, 2011, Fissler et al., 2022). Proper scoring rules ensure incentive compatibility (eliciting the target functional, e.g., predictive mean or quantile) and are foundational for out-of-sample model comparison.

Key distinctions among standard rules are:

Logarithmic score: Penalizes overconfidence, biases toward higher entropy, suitable when risk of extreme over-certainty is detrimental.
Spherical score: Penalizes underconfidence, prefers low-entropy/concentrated forecasts.
Brier/quadratic/CRPS: Symmetric, L2-based, no entropy bias.

Selection of the scoring rule should align with the real-world risk profile of over- vs. under-confidence and the prediction target functional (mean, quantile, class probability, etc.). Calibration assessment—via identification functions, reliability diagrams, and hypothesis testing—complements scoring by probing systematic deviations from truthful reporting and fortifies cross-model reliability (Fissler et al., 2022).

7. Practical Applications and Future Directions

Model capability scoring frameworks underpin a range of applied tasks:

Representative subset construction for efficient evaluation (Wang et al., 13 Aug 2025).
Dynamic model routing based on capability profiles (Zhang et al., 24 Feb 2025).
Data selection and curation via multidimensional coverage and balance (Mo et al., 29 Sep 2025).
Automated capability and failure mapping for audit, safety, and regulatory compliance (Lu et al., 11 Feb 2025, Afkanpour et al., 22 May 2025).
Specialization and transfer to domain- or modality-specific evaluators (Zhang et al., 16 Dec 2024).

Challenges remain in defining universally robust “trying” tests for high-level capabilities, integrating mechanism-centric and behavioral perspectives, scaling automated discovery, and efficiently supporting emerging task categories. Ongoing research explores richer capability embeddings, adaptive and disentangled scoring, more interpretable mechanism probes, and further reduction in human oversight.

In summary, model capability scoring is a rapidly maturing pillar of foundation model evaluation, integrating formal semantics, interpretability, automated testing, multidimensional coverage, and fair statistical inference into theoretically grounded and empirically validated pipelines that support both general and specialized model assessment.