E-Scores: Rigorous Evaluation Metrics

Updated 4 November 2025

E-Scores are evaluation metrics defined by rigorous mathematical formulations, used to quantify correctness, performance, or quality across various domains.
They adapt to application-specific needs, with variants including token-level scores in NLP, e-value based assessments for LLM outputs, and sector-weighted scores in ESG analysis.
While offering strong interpretability and robust statistical guarantees, E-Scores may obscure detailed diagnostic insights, necessitating complementary span-level or targeted metrics.

E-Scores are a family of evaluation metrics used across multiple computational and quantitative disciplines to assess correctness, performance, or quality—typically via scoring or rating—in domains such as machine learning, computational linguistics, finance, social impact, and decision analysis. While the term itself is general and context-dependent, recent research has produced highly technical and domain-specific implementations of E-Scores, ranging from precision-recall F1 variants for disfluency removal (Teleki et al., 24 Sep 2025), e-value-based guarantees for generative model correctness (Dhillon et al., 29 Oct 2025), environmental pillar scores in ESG frameworks (Sahin et al., 2021, Chen, 2023, Bax et al., 2021), scalar annotation protocols (Sakaguchi et al., 2018), and more. The central properties of E-Scores are their rigorous mathematical foundation, interpretability, and adaptability to application-specific requirements.

1. E-Scores as Token-Level Evaluation Metrics in NLP

The archetypal E-Score suite comprises precision ( $\mathcal{E}_P$ ), recall ( $\mathcal{E}_R$ ), and F1 ( $\mathcal{E}_F$ ), defined over true positives (tp), false positives (fp), and false negatives (fn) when evaluating predicted versus gold token alignments: $\mathcal{E}_P = \frac{\Sigma_{tp}}{\Sigma_{tp} + \Sigma_{fp}},\quad \mathcal{E}_R = \frac{\Sigma_{tp}}{\Sigma_{tp} + \Sigma_{fn}},\quad \mathcal{E}_F = \frac{2 \cdot \mathcal{E}_P \cdot \mathcal{E}_R}{\mathcal{E}_P + \mathcal{E}_R}$ These E-Scores are widely used for word-level system performance assessment. In the context of spoken language disfluency removal, they measure how well a model eliminates disfluent tokens without excessive deletion of fluent material (Teleki et al., 24 Sep 2025). However, these aggregate scores have limited diagnosticity, failing to reveal which linguistic structures models struggle with (e.g., parentheticals or interjections).

A plausible implication is that relying exclusively on E-Scores may hinder error analysis and targeted downstream improvement. This limitation has prompted the development of span-level, linguistically-grounded metrics such as Z-Scores (Teleki et al., 24 Sep 2025), which report type-specific (EDITED, INTJ, PRN) removal rates and expose failure modes concealed by E-Scores.

Property	E-Scores	Z-Scores
Granularity	Token-level, aggregate	Span-level, typed
Diagnosticity	Low	High

2. E-Scores Based on E-Values for Generative Model (In)Correctness

A recent and technically distinctive incarnation is the e-value-based E-Score framework for assessing the (in)correctness of LLM and generative model outputs (Dhillon et al., 29 Oct 2025). Unlike p-value-based conformal prediction, which provides coverage only for pre-chosen error tolerance settings, E-Scores constructed from e-values maintain strict post-hoc statistical guarantees—specifically, controlling a quantity termed size distortion even when users set tolerance thresholds ( $\alpha$ ) adaptively after inspecting the responses.

Given calibration data and oracle scores (often estimated via LLM-based verifiers), the E-Score for a response $\mathbf{y}^{n+1}$ to prompt $x^{n+1}$ takes the general form: $s_{\text{e-score}}(x^{n+1}, \mathbf{y}^{n+1}) = \frac{n+1 \cdot f(x^{n+1}, \mathbf{y}^{n+1})}{f(x^{n+1}, \mathbf{y}^{n+1}) + \sum_{i=1}^{n} f^*(x^i, \mathds{O}(x^i, g_\pi(x^i)))}$ where $f(\cdot)$ is a correctness proxy and $f^*(\cdot)$ is maximized over calibration errors.

The critical property is: $\mathbb{E} \left[\frac{1\left\{\exists (\mathbf{Y}, \cdot) \in \mathds{S}_{\alpha}(X, g_\pi(X)) \text{ s.t. } o(X, \mathbf{Y}) = 0 \right\}}{\alpha} \right] \leq 1$ for any post-hoc $\alpha$ (Dhillon et al., 29 Oct 2025).

Empirical evaluations on mathematical factuality and property constraint satisfaction benchmarks indicate that e-score-based assessment enables rigorous, post-hoc filtering of LLM outputs—with expected error rates tightly controlled—unlike p-score approaches subject to p-hacking.

Guarantee	P-score	E-score (e-value)
Post-hoc valid	No	Yes
Size distortion	No	Yes

3. Environmental E-Scores in ESG Analysis

In sustainability and financial risk contexts, the "E-Score" (Environmental pillar score) is a component of the ESG (Environmental, Social, Governance) rating system (Sahin et al., 2021, Chen, 2023, Bax et al., 2021). The E-Score is typically formed as a sector-weighted linear aggregation: $x_{ESG} = w_E \cdot E + w_S \cdot S + w_G \cdot G$ where $w_E, w_S, w_G$ are sector-specific weights applied to the respective pillar scores, commonly sourced from reporting data and subject to annual methodological updates.

Refinitiv, for example, produces E-Scores by aggregating category scores for Resource Use, Emissions, and Environmental Innovation; sector weights vary (e.g., Emissions weighted at 0.17 for Energy, but only 0.02 for Banking) (Chen, 2023). Reliability concerns are addressed via assessment of the impact of missing data (the M-pillar), multi-pillar optimization, and robust regressions, with R-squared values approaching 1.0 for the weighted aggregation model.

A plausible implication is that E-Scores—when deployed as exclusionary portfolio filters—may misclassify companies due to missing disclosures rather than actual poor environmental performance. The ESGM (Environmental, Social, Governance, Missing) scoring methodology further quantifies and penalizes such missing data, improving risk sensitivity (Sahin et al., 2021).

4. E-Scores in Efficient Annotation Protocols

For dataset construction and evaluation, E-Scores also denote scalar labels assigned by human annotators (Sakaguchi et al., 2018). The EASL (Efficient Annotation of Scalar Labels) protocol blends direct assessment (scalar assignment per instance) and pairwise ranking aggregation, representing each item's true score as a posterior Beta distribution: $S_i \sim \mathrm{Beta}(\alpha_i, \beta_i)$ Updated online via: $\alpha_i \leftarrow \alpha_i + s_i,\quad\beta_i \leftarrow \beta_i + (1 - s_i)$ where $s_i$ is an annotator's normalized score.

Empirical results indicate EASL reaches oracle-level correlation in half the annotation cost versus pure direct assessment, with greater robustness in bounded range and active selection efficacy.

5. E-Scores in Decision Analysis and Outranking

Electre-Score methods extend E-Scores to multi-criteria decision analysis (MCDA) by employing outranking relations rather than compensatory aggregation (Figueira et al., 2019). Alternatives are ranked into score intervals $(s^l(a), s^u(a))$ based on their relation to reference sets with pre-assigned scores via procedures such as the deck-of-cards technique and Electre Tri-nB. This robustifies evaluations against imperfect knowledge, avoids compensatory masking of poor performance, and yields stable, interpretable intervals rather than fragile point scores.

6. Domain-Specific Variants and Limitations

While E-Scores offer mathematically precise and interpretable evaluation in varied contexts, domain-specific limitations are recognized:

Aggregation can obscure category- or linguistically-specific weaknesses (as in NLP disfluency tasks; see Z-Scores comparisons (Teleki et al., 24 Sep 2025)).
In correctness-filtering for generative models, coverage guarantees can be invalidated by adaptively setting thresholds post-hoc, unless e-value foundations are used (Dhillon et al., 29 Oct 2025).
For ESG analysis, static sector weights and missing data may compromise risk informativeness and longitudinal comparability (Sahin et al., 2021, Chen, 2023). A plausible implication is that E-Scores should be complemented by span-level, category-specific, or robustness-enhancing metrics depending on the application.

7. Summary and Impact

E-Scores—a term encompassing a spectrum of rigorous, formally defined evaluation metrics—serve as foundational tools for correctness, quality, and risk assessment in machine learning, finance, linguistics, and decision analysis. Their design benefits from domain-specific adaptation (token-level, span-level, e-value, outranking, etc.) with empirical evidence and theoretical guarantees substantiating their effectiveness. However, awareness of their aggregation-induced blind spots and post-hoc validity constraints remains essential for advanced deployment in research and practice. Researchers are encouraged to supplement E-Scores with orthogonal diagnostics for maximal analytic value.