Interpretation of the unified capability score
Determine an interpretable semantics for the scalar capability score C_m produced by the benchmark-stitching model defined by score(m, b) = σ(α_b(C_m − D_b)), so that specific values (e.g., C_m = 3) correspond to clearly understood, practical capabilities that can be communicated without reference to the original dataset or fitting procedure.
References
It is unclear what exactly is meant by a ``capabilities score" of 3.
— A Rosetta Stone for AI Benchmarks
(2512.00193 - Ho et al., 28 Nov 2025) in Section 5.1 (Discussion: Challenges in interpreting model capability and benchmark difficulty scores)