Interpretation of the unified capability score

Determine an interpretable semantics for the scalar capability score C_m produced by the benchmark-stitching model defined by score(m, b) = σ(α_b(C_m − D_b)), so that specific values (e.g., C_m = 3) correspond to clearly understood, practical capabilities that can be communicated without reference to the original dataset or fitting procedure.

Background

The paper compresses diverse benchmark results into a single numerical capability score C_m for each model and a difficulty score D_b for each benchmark using a sigmoidal mapping. While this facilitates cross-benchmark and cross-time comparisons, the authors note that users may struggle to interpret what any given capability value actually means in practice.

The discussion section highlights that intuitive judgments of model capabilities often incorporate factors beyond benchmark accuracy (e.g., speed and cost), and that even proposed mappings (such as time-horizon metrics) remain speculative. Clarifying the semantics of the capability scale would make the framework more useful for forecasting, communication, and decision-making.

References

It is unclear what exactly is meant by a ``capabilities score" of 3.

— A Rosetta Stone for AI Benchmarks (2512.00193 - Ho et al., 28 Nov 2025) in Section 5.1 (Discussion: Challenges in interpreting model capability and benchmark difficulty scores)

Interpretation of the unified capability score

Background

References

Related Problems