Papers
Topics
Authors
Recent
2000 character limit reached

Rashomon Sets: Evaluating Model Multiplicity

Updated 12 November 2025
  • Rashomon sets are defined as collections of models that meet a specified performance tolerance relative to the optimal model, quantifying model multiplicity in machine learning.
  • They enable aggregation of explanations—such as partial dependence profiles—to capture interpretive uncertainty and identify variability in feature effects.
  • Empirical studies demonstrate that using Rashomon sets in AutoML exposes significant discrepancies in single-model explanations, urging caution in trusting singular interpretations.

A Rashomon set is the collection of all models within a given hypothesis class whose predictive performance is nearly indistinguishable from the optimal model under a specified loss function and tolerance. This concept formalizes the empirical observation, originally articulated by Breiman, that many equally accurate but structurally distinct models can exist for a single dataset, and these models can yield divergent—but equally valid—interpretations or explanations. The Rashomon set has become foundational in research on interpretability, uncertainty quantification, fairness, and robustness in automated machine learning (AutoML) and explainable AI.

1. Formal Definition and Foundational Properties

Let F\mathcal{F} be a hypothesis class and L(f)\mathcal{L}(f) a loss (or risk) functional on models fFf \in \mathcal{F}. The optimal model ff^* is defined as

f=argminfFL(f)f^* = \arg\min_{f \in \mathcal{F}} \mathcal{L}(f)

Given a user-specified tolerance ϵ0\epsilon \ge 0, the ϵ\epsilon-Rashomon set is

R(ϵ)={fF  L(f)L(f)+ϵ}\mathcal{R}(\epsilon) = \{f \in \mathcal{F} ~|~ \mathcal{L}(f) \le \mathcal{L}(f^*) + \epsilon \}

For multiplicative metrics,

Rϵ={MM  ϕ(M)ϕ(M)(1+ϵ)}\mathcal{R}_\epsilon = \{ M \in \mathcal{M} ~|~ \phi(M) \le \phi(M^*) (1 + \epsilon) \}

where ϕ\phi denotes a performance metric (e.g. mean squared error) and MM^* the best model.

Practical selection of ϵ\epsilon is driven by domain-specific error tolerance, e.g., ϵ=0.05\epsilon = 0.05 for a 5% performance gap, or through inspection of the trade-off between model multiplicity and incumbent risk.

2. Rashomon Set Identification Algorithms in AutoML

AutoML systems, such as H2O AutoML, output a candidate model set {M1,...,MK}\{M_1, ..., M_K\}. The Rashomon set extraction proceeds as:

  1. Evaluate ϕ(Mk)\phi(M_k) for each MkM_k on validation/test data.
  2. Determine M=argminkϕ(Mk)M^* = \arg\min_k \phi(M_k).
  3. Select Rϵ={Mk:ϕ(Mk)ϕ(M)(1+ϵ)}R_\epsilon = \{ M_k : \phi(M_k) \le \phi(M^*)(1+\epsilon) \}.
  4. Optionally, prune candidates early if partial validation loss exceeds the Rashomon threshold.

Computational cost is O(Kn)O(K n) for validation over KK models and nn data points. Once RϵR_\epsilon is known, subsequent computations (e.g. partial dependence profiles) scale as O(Rϵmn)O(|R_\epsilon| m n), mm being the evaluation grid size per feature.

Incremental or early stopping strategies allow efficient pruning of large candidate pools.

3. Aggregation and Quantification of Explanation Uncertainty

To capture interpretive variability, the Rashomon set approach aggregates explanations across all near-optimal models. For partial dependence profiles (PDP), the procedure is as follows:

  • For feature XjX_j, grid {x1,...,xm}\{x_1, ..., x_m\} and each model MkM_k, compute:

PDP^j(k)(x)=1ni=1nf^(k)(x,xi,j)\widehat{PDP}_j^{(k)}(x_\ell) = \frac{1}{n} \sum_{i=1}^n \hat f^{(k)}(x_\ell, \mathbf{x}_{i,-j})

  • Aggregate via uniform averaging:

PDPj(x)=1RϵMkRϵPDP^j(k)(x)\overline{PDP}_j(x_\ell) = \frac{1}{|R_\epsilon|} \sum_{M_k \in R_\epsilon} \widehat{PDP}_j^{(k)}(x_\ell)

  • Quantify uncertainty with bootstrap confidence intervals CIj(x)CI_j(x_\ell):

CIj(x)=[Qα/2{PDPj(b)(x)},  Q1α/2{PDPj(b)(x)}]CI_j(x_\ell) = \left[ Q_{\alpha/2}\{\overline{PDP}_j^{(b)}(x_\ell)\},\; Q_{1-\alpha/2}\{\overline{PDP}_j^{(b)}(x_\ell)\} \right]

using BB bootstrap samples Rϵ(b)R_\epsilon^{(b)}.

Two metrics characterize the agreement between the best-model PDP and the Rashomon PDP:

  • Coverage Rate (CR):

CRj=1m=1m1(PDPj(x)CIj(x))CR_j = \frac{1}{m} \sum_{\ell=1}^m \mathbf{1}(PDP_j^*(x_\ell) \in CI_j(x_\ell))

  • Mean Width of Confidence Interval (MWCI):

MWCIj=1m=1m(Q1α/2{PDPj(b)(x)}Qα/2{PDPj(b)(x)})MWCI_j = \frac{1}{m}\sum_{\ell=1}^m \Big( Q_{1-\alpha/2}\{\overline{PDP}_j^{(b)}(x_\ell)\} - Q_{\alpha/2}\{\overline{PDP}_j^{(b)}(x_\ell)\} \Big)

Low CRjCR_j or high MWCIjMWCI_j signals high epistemic uncertainty about the feature's effect.

4. Empirical Evidence and Observed Patterns

Empirical evaluation was conducted over 35 regression tasks from the OpenML-CTR23 benchmark using H2O AutoML (up to 20 models, 360 s runtime, ϵ=0.05\epsilon=0.05). The principal findings include:

  • In 80%\sim80\% of datasets, the Rashomon PDP covered less than 70% of the best model's PDP, i.e., CRj<0.7CR_j < 0.7 for a typical feature.
  • MWCI spanned several orders of magnitude (e.g., MWCI0.0003MWCI \approx 0.0003 on grid_stability to MWCI8782MWCI \approx 8782 on california_housing), demonstrating substantial heterogeneity in explanation uncertainty across datasets.
  • Strong negative correlation between Rashomon ratio RR=Rϵ/KRR = |R_\epsilon|/K and coverage rate (ρ=0.53\rho = -0.53, p=0.003p = 0.003): as the number of near-optimal models increases, their explanations diversify and coverage falls.
  • Individual datasets displayed extensive variation in the size of Rashomon sets, the width of the confidence bands, and the degree to which single-model explanations failed to account for plausible alternative explanations.

The following table summarizes top-line metrics (see Table 1 in the original paper):

Dataset Best RMSE Rϵ|R_\epsilon| RRRR MWCIMWCI CRCR
grid_stability $0.027$ $13$ $0.65$ $0.0003$ $0.85$
california_housing $1.23$ $14$ $0.70$ $8782$ $0.40$

This evidence demonstrates that single-model explanations can be misleading; the explanation uncertainty induced by model multiplicity is substantial and varies greatly by task.

5. Trustworthiness, Interpretation, and Broader Impact

The Rashomon set methodology provides a principled foundation for quantifying explanation uncertainty and promoting robust, human-centered explainability:

  • Aggregated Rashomon PDPs expose not only where near-optimal models agree, but crucially, where they disagree, thus making the local uncertainty in variable effects explicit.
  • In high-stakes applications (e.g. risk modeling, healthcare), such transparency helps prevent overconfidence in single-model interpretations and supports more cautious, informed decision-making.
  • Observationally, standard single-best explanations often fail to reflect up to 30–70% of plausible model behavior, as indicated by low coverage rates. In contrast, Rashomon-based explanations fully reflect the spectrum of plausible interpretations allowed by the data and model class.
  • The authors advocate extending the Rashomon-type aggregation and uncertainty-quantification to other explanation modalities (e.g. SHAP values, individual conditional expectation plots) and to classification or clustering tasks.

Empirical results rigorously support the assertion that single-model explanations frequently understate interpretive uncertainty, particularly in AutoML workflows, and that Rashomon set aggregation is a practical, generalizable approach to remedy this for trustworthy explainable AI.

6. Limitations and Future Research Directions

While the Rashomon set approach introduces a rigorous mechanism for interpretive uncertainty assessment, several challenges and open directions remain:

  • The methodology, as instantiated, is tied to the model pool output by a given AutoML system; the Rashomon set is only as broad as the candidate models provided.
  • Computational requirements scale with the candidate pool size KK, the grid size mm, and the number of features, but can be mitigated through early stopping and pruning.
  • The framework relies on PDPs, which can themselves be biased in the presence of strongly correlated features or extrapolation outside the data support. Extending to other local- and global-explanation forms is a recommended path.
  • The choice of tolerance ϵ\epsilon directly governs the Rashomon set size and the informativeness of the aggregated explanations; principled procedures for selecting ϵ\epsilon remain to be standardized.
  • Expanding Rashomon aggregation procedures beyond regression, to classification and unsupervised learning, and to domain-specific interpretability constraints, is a promising avenue for future research.

In summary, the Rashomon set formalism establishes a quantitative, model-agnostic basis for incorporating model multiplicity into explanation generation in AutoML, and its adoption is likely to deepen the reliability and transparency of AI explanations, particularly in sensitive application contexts (Cavus et al., 19 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Rashomon Sets.