Counterfactual Evaluation Score (CES)
- CES is a quantitative metric that evaluates model and explanation performance under hypothetical interventions by balancing prediction flip rates and perturbation costs.
- It unifies causal inference methods to assess metrics across diverse settings, including abstention, sequential trajectories, and counterfactual risk evaluation.
- Empirical studies indicate that higher CES values correlate with enhanced explanation faithfulness and improved decision-making in applications like healthcare and climate forecasting.
A Counterfactual Evaluation Score (CES) is any quantitative metric that evaluates the potential performance, quality, or faithfulness of an algorithm or explanation under counterfactual scenarios—that is, situations generated by targeted hypothetical interventions on input variables, policies, or decisions. CES is not a single standardized metric; rather, distinct notions of CES have been introduced in the literature to address model performance under abstention, explanation faithfulness, risk under alternate decisions, or progress toward desired outcomes in sequential or time-series tasks. CES unifies a class of evaluation techniques grounded in causal or counterfactual inference, and typically combines predictions from models with the observed or imputed responses to counterfactual queries.
1. Formal Definitions and Core Principles
CES metrics are generically designed to assess how a model, explanation, or action would have performed if subjected to hypothetical perturbations or alternative decisions. The formal structure of CES varies by context, but always incorporates three main components:
- Counterfactual Scenario Construction: Generation or specification of interventions (e.g., editing important features, forcing predictions on abstained samples, evaluating under alternative treatment assignments, or targeting an explicit counterfactual state).
- Response or Score Evaluation: Model output or quality metric computed under the counterfactual scenario, such as prediction loss, class probability, or distance to a desired outcome.
- Aggregation and Normalization: Population- or trajectory-level summarization, typically via expected loss, ratios, or normalized alignment scores.
The unifying feature is that the evaluation explicitly considers hypothetical, never-actually-observed settings aligned with a causal or policy-relevant counterfactual query.
2. CES for Faithfulness of Explanations
The canonical CES in explainable AI is the faithfulness score for feature-attribution explanations (Ge et al., 2021). The CES is defined as the fraction of instances for which intervening on the features deemed “important” by an explanation flips the model’s prediction, normalized by the proximity or perturbation cost of those interventions. For a model , input , and explanation (a subset of features), the hard-label CES is:
where is a counterfactual editing only , and is the model’s output on . Proximity is measured by a domain-appropriate distance function (e.g., Euclidean, edit, cosine). Larger CES values indicate that the explanation’s features can flip the prediction through minimal and plausible edits, and thus identify features causally responsible for a model’s output.
This CES improves on erasure-based metrics by constructing realistic counterfactuals and explicitly balancing the flip rate (validity) against intervention cost (proximity). In empirical studies, it exhibits higher correlation with “oracle” ground-truth explanation quality than deletion or masking scores (Ge et al., 2021).
3. CES in Sequential and Trajectory-Based Settings
For sequential decision processes and time series, the main CES formulation is the Trajectory Counterfactual Explanation (TraCE) score (Clark et al., 2023). Here, CES quantifies local progress toward a user-defined or model-derived counterfactual target at each step in a trajectory. At time :
- Let be the current state, 0 the next state, 1 the counterfactual target.
- Compute factual step 2 and desired step 3.
- Two alignment components:
- Angle score: 4.
- Landing score: 5 with 6 the projected landing point and 7.
- Combine into the TraCE score:
8
where 9 weights direction vs. distance. Satisfies 0; 1 indicates movement toward 2. In practice, multiple counterfactual targets may be averaged per step.
TraCE enables meaningful quantification of alignment with sequential goals across diverse domains, such as ICU-patient health trajectories or climate change indicators (Clark et al., 2023).
4. CES for Counterfactual Risk, Evaluation, and Fairness
In causal evaluation of algorithmic risk scores and decision models, CES refers to the expected counterfactual performance under alternative policies or treatments (Coston et al., 2019, Choe et al., 2023). Suppose each instance 3 has observed covariates 4, decision 5 (treatment/action), and outcome 6. The counterfactual outcome 7 is the outcome had 8 been assigned. The central metrics are estimands such as:
- Counterfactual true positive rate: 9.
- Counterfactual precision: 0.
Doubly-robust estimators combine outcome regression and propensity-score modeling for consistent CES computation even under model misspecification. For abstaining classifiers, the CES is defined as the expected loss if the classifier were forced to predict everywhere, treating abstention as missing data (Choe et al., 2023).
Identification of these CES metrics requires assumptions of unconfoundedness/conditional exchangeability and overlap. Under these conditions, DR estimators are asymptotically normal and root-1 consistent.
| CES Context | Counterfactual Query | Key Metric Structure |
|---|---|---|
| Explanation Faithfulness | Model flips via critical edits | Flip rate / intervention cost |
| Sequential Alignment | Progress toward target outcome | Alignment score 2 |
| Risk/Fairness Evaluation | Alternate policy/treatment risk | DR-estimated performance/fairness |
| Abstention | Performance if forced to predict | Population loss under no abstain |
5. Algorithmic Procedures and Computation
The algorithmic realization of CES is context-dependent but follows a common pattern:
- Counterfactual Construction: For each input or time step, produce a minimally perturbed input or trajectory consistent with the counterfactual scenario—using discrete search, gradient descent in embedding space, or temporal interventions.
- Model Evaluation: Apply the model to the counterfactual and record outputs (hard label, probability, or loss).
- Score Aggregation: Compute the primary metric (flip rate, average alignment, expected loss) and normalize by perturbation cost or over the test population.
- Robust Estimation: In risk/fairness evaluation, employ inverse-probability weighting, outcome regression, and doubly robust techniques to adjust for missing data or non-random assignment.
- Time Aggregation (for trajectories): Use instantaneous, cumulative, or time-weighted averages to express overall CES.
In the presence of high-dimensional data, nearest-neighbor search and regularization are critical. For time series and sequential data, CES-based objectives are often strongly convex in linear cases, enabling analytic solutions; for nonlinear models, gradient-based optimization is used (Kinjo, 10 Nov 2025, Clark et al., 2023).
6. Empirical Studies and Properties
CES-based metrics have been extensively validated in both synthetic and real-world settings:
- Explanation Faithfulness: On benchmark datasets (UCI Adult, movie reviews), CES outperformed erasure-based metrics in agreement with “oracle” explanations, achieving near-perfect Kendall’s 3 and Spearman’s 4 (Ge et al., 2021).
- Trajectory Alignment: In ICU simulation, high TraCE aligned with successful patient discharge, and in climate trajectory analysis, TraCE reliably ranked countries’ alignment with climate scenarios (Clark et al., 2023).
- Risk Assessment: In counterfactual risk evaluation for child welfare, doubly robust CES yielded performance curves matching the true counterfactual outcomes and aligning with domain expert expectations, while standard observational metrics did not (Coston et al., 2019).
- Abstention: On CIFAR-100, CES permitted valid and efficient comparison among abstaining classifiers, enabling performance estimation even on abstained test points (Choe et al., 2023).
- Time-Series Forecasting: For multivariate time series with exogenous interventions, validity (X-loss) and proximity (Z-loss) tracked the efficacy of counterfactual forecasts, and overall total loss provided a convex, interpretable CES (Kinjo, 10 Nov 2025).
These scores are sensitive to choice of intervention, weighting, and corpus—requiring domain-aware parameterization and validation.
7. Limitations and Best Practices
CES scores depend crucially on the construction and realism of counterfactuals: spurious, out-of-distribution, or infeasible interventions can degrade interpretability and trustworthiness. In high-dimensional settings, k-NN-based counterfactuals may lack semantic coherence (“curse of dimensionality”). For trajectory and time-series CES, local (greedy) alignment does not guarantee eventual attainment of global counterfactual targets (Clark et al., 2023).
Model-agnostic CES constructions avoid dependence on specific decision policies but may only measure local alignment. For doubly robust CES estimation under treatment assignment, identification breaks down under unmeasured confounding or deterministic abstention (Choe et al., 2023, Coston et al., 2019).
Recommended practices include:
- Careful selection/tuning of proximity and angle weights (e.g., 5 in TraCE).
- Sensitivity analyses on the number and type of counterfactual interventions.
- Ensuring counterfactual plausibility (using domain constraints or generative models).
- Using cross-fitting and nonparametric learners for robust estimation of nuisance parameters in DR counterparts.
CES metrics provide a principled, causally coherent toolkit for evaluating machine learning models and explanations under hypothetical, policy-relevant scenarios. Despite variant formulations, their shared core is the explicit measurement of performance, progress, or faithfulness with respect to targeted, feasible counterfactual alternatives.
References:
(Ge et al., 2021, Clark et al., 2023, Choe et al., 2023, Coston et al., 2019, Kinjo, 10 Nov 2025)