DeepCAVE: AutoML & 3D Interpretability

Updated 8 December 2025

DeepCAVE is a suite of interactive frameworks that enhances transparency by converting complex AutoML and HPO processes into human-interpretable visualizations.
Its modular Python architecture, dynamic converters, and plugin-based dashboard enable real-time analysis, debugging, and statistical exploration of trial data.
In 3D classification, DeepCAVE employs concept-based frameworks to transform high-dimensional neural features into actionable insights, boosting model robustness and explainability.

DeepCAVE is a suite of interactive frameworks and methodologies designed to provide transparency, interpretability, and rigorous analysis for complex automated machine learning (AutoML) and hyperparameter optimization (HPO) processes, as well as for interpretable 3D neural object classifiers. The term “DeepCAVE” most commonly refers to browser-based tools and analysis engines for visualizing, debugging, and comparing large-scale HPO runs (Sass et al., 2022, Segel et al., 1 Dec 2025), but in recent literature it has also been used to describe a 3D concept-based framework (CAVE) for interpretable and robust neural classification via neural object volumes (Pham et al., 17 Mar 2025). The commonality lies in the ambition to turn opaque, high-dimensional model evaluation and optimization datasets into human-interpretable narratives, actionable insights, or explanations.

1. Architectural Principles and System Overview

The core of DeepCAVE as an AutoML/HPO analysis tool is a fully modular Python codebase centered on a unified “Run” abstraction. Optimizer output (from SMAC3, DEHB, BOHB, Auto-Sklearn, Auto-PyTorch, Optuna, Ray Tune, AMLTK, or generic CSV/Parquet sources) is monitored or ingested via converter modules. These converters continuously watch disk state or respond to direct API calls, translating all relevant trial metadata—configuration vectors, budgets, random seeds, objective vectors, and trial statuses—into standardized Run instances (Sass et al., 2022, Segel et al., 1 Dec 2025).

The analysis engine performs all heavy computational tasks in a background worker queue (with hash-based caching), while a web-based dashboard (Plotly/Dash stack) exposes interactive dashboards with plugins (“cards”) for exploratory, diagnostic, and statistical visualizations. A highly extensible plugin API enables development of new input, filter, and output modules, facilitating rapid adaptation to new AutoML backends or novel visualization needs.

2. Mathematical and Analytical Foundations

DeepCAVE integrates foundational concepts from hyperparameter optimization and AutoML. The configuration space is $\mathcal{X} = X_1 \times \dots \times X_d$ for (possibly mixed) hyperparameters, and $f: \mathcal{X} \rightarrow \mathbb{R}^m$ denotes the multi-objective function (e.g., accuracy, runtime, resource usage). Standard HPO seeks $x^* = \arg\min_{x\in\mathcal{X}} f(x)$ . For $m > 1$ , non-dominated configurations (Pareto-optimal) are visualized to facilitate multi-objective trade-off analysis (Segel et al., 1 Dec 2025).

When Bayesian optimization is used, the analysis can surface surrogate models $\hat f_n$ and acquisition functions such as Expected Improvement: $\alpha_{EI}(x) = \mathbb{E}_{Y\sim\hat{f}_n(x)}\bigl[\max(f_{best} - Y, 0)\bigr]$ Budget-based and multi-fidelity optimization (e.g., via Hyperband or BOHB) are represented by including budget/fidelity axes in trial histories. Key statistical methods include fANOVA for global hyperparameter importances,

$I_p = \frac{\mathrm{Var}_{\lambda_p}\left[\mathbb{E}_{\lambda_{\neg p}}[C(\lambda)]\right]}{\mathrm{Var}[C(\lambda)]}$

and local permutation or ablation-based analyses for fine-grained performance attribution (Sass et al., 2022, Segel et al., 1 Dec 2025).

3. Dashboard Functionality and Plugin-Based Visualization

The dashboard is plugin-based, organizing analysis modules into discrete, user-configurable cards:

Overview and Configuration Inspection: Aggregates meta-information (optimizer, search space, objectives), with tables for per-trial results, budgets, and statuses (success, crashed, timeout, etc.).
Exploration Footprint: Applies dimensionality reduction (e.g., non-metric MDS) to visualize sampled configurations, revealing exploration bias, clustering, or coverage gaps.
Convergence and Time-Series Analysis: Plots best observed objective versus wall-clock time, differentiating optimization efficiency across runs or parameterizations.
Multi-Objective Pareto Fronts: Renders non-dominated solutions in two-objective projections, highlighting trade-offs and enabling direct selection of configurations for detailed breakdowns.
Hyperparameter Interactions: Includes parallel coordinate plots, partial dependence plots, and symbolic regression explanations, all linked to trial selection and filter settings.
Parameter Importances and Ablations: Global importances (fANOVA), localized importances, and sequential ablation paths are available for both holistic and granular insight.
Budget/Fidelity Correlation: Heatmaps and scatterplots of performance at different budgets allow evaluation of fidelity scheduling and resource allocation rationality (Sass et al., 2022, Segel et al., 1 Dec 2025).

Common interactions include hover/click callbacks for tracebacks, exporting code snippets to rerun any configuration, and responsive updates to all analytics as filter controls are applied.

4. Workflow, Dataflow, and Extensibility

The typical workflow proceeds as follows: users point DeepCAVE to one or more optimizer run directories or register runs via the Python API; converters monitor or import trial data into memory; analysis plugins operate in real time, driven by user selection and filters; outputs (text, tables, plots) are rendered interactively in the browser. All modules are extensible via an API oriented around subclassing and plugin registration—new visualizations or import converters can be added and deployed without modification of the core codebase (Sass et al., 2022, Segel et al., 1 Dec 2025).

Illustrative pseudocode for custom output plugins and converters:

from deepcave.plugin import BaseOutputPlugin
class MyScatterPlugin(BaseOutputPlugin):
    name = "My 2D Scatter"
    def output(self, run, filter_values):
        trials = run.get_trials(filter_values.budgets)
        xs = [t.objectives[0] for t in trials]
        ys = [t.objectives[1] for t in trials]
        fig = dict(data=[dict(x=xs, y=ys, mode='markers')])
        return fig

from deepcave.converter import BaseConverter
class MyAutoMLConverter(BaseConverter):
    name = "MyAutoML"
    def can_handle(self, path): ...
    def convert(self, path):
        # parse log, build RunInstance(meta, Lambda, C, B, history)
        return my_run_instance

5. Analytical Methods and Case Study Results

Core analytical methods supported include:

Trial Outcome Summaries: Proportions of successful, crashed, or timed-out evaluations.
Exploration Footprint (MDS): Projects trial configurations to $\mathbb{R}^2$ using distance metrics appropriate to hyperparameter type.
Pareto and Budget Correlations: Calculates Pearson correlation $\rho(b_i,b_j) = \mathrm{Corr}[C(\lambda,b_i), C(\lambda,b_j)]$ across budgets.
Hyperparameter Importances: Both global (fANOVA) and local (LPI) analyses.
Particular question-answer workflows: e.g., in an outlier detection example (pendigits, 15% contamination, 39 hyperparameters including AE, VAE, DASVDD, DAGMM), DeepCAVE enabled diagnosis of trial failures (“96.66% successful, 3.24% crashed”), exploration uniformity, efficiency of budget selection, trade-off optima, and critical hyperparameter identification (learning rate, batch size, model choice). MDS and importance plots guided further configuration space exploration and pruning (Sass et al., 2022).

6. DeepCAVE in Interpretable 3D Neural Classification

In 3D neural classification, the CAVE framework (“Concept Aware Volumes for Explanations”)—sometimes colloquially referred to as “DeepCAVE”—extends neural object volume (NOVUM)-based classifiers to enhance both robustness and interpretability (Pham et al., 17 Mar 2025). NOVUM represents each class as a fixed cuboid mesh surface with $K$ 3D Gaussian “probes,” each associated with a feature vector. High-dimensional surface features $G_y \in \mathbb{R}^{K\times C}$ are dictionary-compressed post hoc into a much smaller set of $D$ “concept vectors” ( $H_y\in\mathbb{R}^{D\times C}, D\ll K$ ), typically via K-Means clustering.

Classification is then performed via a bag-of-words style matching of 2D image features against these 3D-aware concepts: $s_y = \sum_{t=1}^D c_y^{(t)}, \quad \text{with } c_y^{(t)} = \sum_{i: f_i \to h_y^{(t)}} (f_i \cdot h_y^{(t)})$ The framework achieves highly competitive OOD and occlusion robustness (e.g., 96.9% at 20–40% occlusion, 81.4% on OOD-CV), and state-of-the-art part-level interpretability (e.g., Part IoU 0.152, Local Coverage 0.259, Global Coverage–object 0.838), compared to direct 2D concept baselines. Ablations confirm advantages from the 3D feature geometry and from clustering in NOVUM’s space rather than using 2D activations (Pham et al., 17 Mar 2025).

7. Impact, Limitations, and Research Significance

DeepCAVE, in both its dashboard and 3D classification incarnations, addresses longstanding deficiencies in transparency and interpretability for high-dimensional AutoML and neural models. It enables direct diagnosis of optimization inefficiencies, run failures, search space exploitation/exploration balance, budget allocations, and sensitivity to hyperparameters—fostering trust and actionable iteration in both academic research and industrial machine learning workflows (Sass et al., 2022, Segel et al., 1 Dec 2025). In the concept-aware neural framework, DeepCAVE bridges robustness and post hoc explainability, supporting concept bottlenecks with formal geometric and statistical grounding (Pham et al., 17 Mar 2025).

Identified limitations include dependency on optimizer logging detail, requirement for object-pose annotations and geometry proxies in the 3D setting, and no strong theoretical guarantees on the interpretability of concept assignments beyond empirical metrics. Nonetheless, DeepCAVE's modularity and statistical scope have made it broadly adopted for next-generation AutoML and explainable AI research.