Rashomon Sets: Evaluating Model Multiplicity
- Rashomon sets are defined as collections of models that meet a specified performance tolerance relative to the optimal model, quantifying model multiplicity in machine learning.
- They enable aggregation of explanations—such as partial dependence profiles—to capture interpretive uncertainty and identify variability in feature effects.
- Empirical studies demonstrate that using Rashomon sets in AutoML exposes significant discrepancies in single-model explanations, urging caution in trusting singular interpretations.
A Rashomon set is the collection of all models within a given hypothesis class whose predictive performance is nearly indistinguishable from the optimal model under a specified loss function and tolerance. This concept formalizes the empirical observation, originally articulated by Breiman, that many equally accurate but structurally distinct models can exist for a single dataset, and these models can yield divergent—but equally valid—interpretations or explanations. The Rashomon set has become foundational in research on interpretability, uncertainty quantification, fairness, and robustness in automated machine learning (AutoML) and explainable AI.
1. Formal Definition and Foundational Properties
Let be a hypothesis class and a loss (or risk) functional on models . The optimal model is defined as
Given a user-specified tolerance , the -Rashomon set is
For multiplicative metrics,
where denotes a performance metric (e.g. mean squared error) and the best model.
Practical selection of is driven by domain-specific error tolerance, e.g., for a 5% performance gap, or through inspection of the trade-off between model multiplicity and incumbent risk.
2. Rashomon Set Identification Algorithms in AutoML
AutoML systems, such as H2O AutoML, output a candidate model set . The Rashomon set extraction proceeds as:
- Evaluate for each on validation/test data.
- Determine .
- Select .
- Optionally, prune candidates early if partial validation loss exceeds the Rashomon threshold.
Computational cost is for validation over models and data points. Once is known, subsequent computations (e.g. partial dependence profiles) scale as , being the evaluation grid size per feature.
Incremental or early stopping strategies allow efficient pruning of large candidate pools.
3. Aggregation and Quantification of Explanation Uncertainty
To capture interpretive variability, the Rashomon set approach aggregates explanations across all near-optimal models. For partial dependence profiles (PDP), the procedure is as follows:
- For feature , grid and each model , compute:
- Aggregate via uniform averaging:
- Quantify uncertainty with bootstrap confidence intervals :
using bootstrap samples .
Two metrics characterize the agreement between the best-model PDP and the Rashomon PDP:
- Coverage Rate (CR):
- Mean Width of Confidence Interval (MWCI):
Low or high signals high epistemic uncertainty about the feature's effect.
4. Empirical Evidence and Observed Patterns
Empirical evaluation was conducted over 35 regression tasks from the OpenML-CTR23 benchmark using H2O AutoML (up to 20 models, 360 s runtime, ). The principal findings include:
- In of datasets, the Rashomon PDP covered less than 70% of the best model's PDP, i.e., for a typical feature.
- MWCI spanned several orders of magnitude (e.g., on grid_stability to on california_housing), demonstrating substantial heterogeneity in explanation uncertainty across datasets.
- Strong negative correlation between Rashomon ratio and coverage rate (, ): as the number of near-optimal models increases, their explanations diversify and coverage falls.
- Individual datasets displayed extensive variation in the size of Rashomon sets, the width of the confidence bands, and the degree to which single-model explanations failed to account for plausible alternative explanations.
The following table summarizes top-line metrics (see Table 1 in the original paper):
| Dataset | Best RMSE | ||||
|---|---|---|---|---|---|
| grid_stability | $0.027$ | $13$ | $0.65$ | $0.0003$ | $0.85$ |
| california_housing | $1.23$ | $14$ | $0.70$ | $8782$ | $0.40$ |
This evidence demonstrates that single-model explanations can be misleading; the explanation uncertainty induced by model multiplicity is substantial and varies greatly by task.
5. Trustworthiness, Interpretation, and Broader Impact
The Rashomon set methodology provides a principled foundation for quantifying explanation uncertainty and promoting robust, human-centered explainability:
- Aggregated Rashomon PDPs expose not only where near-optimal models agree, but crucially, where they disagree, thus making the local uncertainty in variable effects explicit.
- In high-stakes applications (e.g. risk modeling, healthcare), such transparency helps prevent overconfidence in single-model interpretations and supports more cautious, informed decision-making.
- Observationally, standard single-best explanations often fail to reflect up to 30–70% of plausible model behavior, as indicated by low coverage rates. In contrast, Rashomon-based explanations fully reflect the spectrum of plausible interpretations allowed by the data and model class.
- The authors advocate extending the Rashomon-type aggregation and uncertainty-quantification to other explanation modalities (e.g. SHAP values, individual conditional expectation plots) and to classification or clustering tasks.
Empirical results rigorously support the assertion that single-model explanations frequently understate interpretive uncertainty, particularly in AutoML workflows, and that Rashomon set aggregation is a practical, generalizable approach to remedy this for trustworthy explainable AI.
6. Limitations and Future Research Directions
While the Rashomon set approach introduces a rigorous mechanism for interpretive uncertainty assessment, several challenges and open directions remain:
- The methodology, as instantiated, is tied to the model pool output by a given AutoML system; the Rashomon set is only as broad as the candidate models provided.
- Computational requirements scale with the candidate pool size , the grid size , and the number of features, but can be mitigated through early stopping and pruning.
- The framework relies on PDPs, which can themselves be biased in the presence of strongly correlated features or extrapolation outside the data support. Extending to other local- and global-explanation forms is a recommended path.
- The choice of tolerance directly governs the Rashomon set size and the informativeness of the aggregated explanations; principled procedures for selecting remain to be standardized.
- Expanding Rashomon aggregation procedures beyond regression, to classification and unsupervised learning, and to domain-specific interpretability constraints, is a promising avenue for future research.
In summary, the Rashomon set formalism establishes a quantitative, model-agnostic basis for incorporating model multiplicity into explanation generation in AutoML, and its adoption is likely to deepen the reliability and transparency of AI explanations, particularly in sensitive application contexts (Cavus et al., 19 Jul 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free