Nested Performance Profile Evaluation

Updated 11 March 2026

Nested performance profile is a methodology that evaluates algorithm, software, and hardware performance in hierarchies and multi-dimensional contexts.
It recursively removes top performers to yield unbiased rankings and stable comparative metrics across nested partitions.
The method applies to solver benchmarking, hierarchical timing in code/hardware, and exhaustive cross-validation, offering practical insights for optimization.

A nested performance profile is a general methodology and set of algorithmic tools for evaluating or profiling the performance of algorithms, software, or hardware in hierarchically structured or multi-level contexts. The approach targets cases where performance evaluation is inherently multi-dimensional: either because multiple solvers or predictors are to be ranked in a fair and robust way, or because execution/time profiling must resolve deeply nested regions (such as loops, function calls, or hardware modules). Nested performance profiles generalize and correct standard performance profiling by successively eliminating top-performers and extending evaluation over nested partitions or levels, yielding unbiased global rankings and informative metrics for non-best contenders or components (Hekmati et al., 2018, Karademir et al., 2 Mar 2026, Kim et al., 4 Apr 2025, Gauran et al., 2024).

1. Motivation and Scope

Classic performance profiling, such as Dolan–Moré performance profiles, provides an effective way to visualize and summarize how a collection of solvers or algorithms perform across a suite of benchmark problems relative to the “best” in each case. However, these profiles fail to offer a stable or consistent ranking of solvers beyond the top performer; upon removing the current best, the relative ordering of the remaining competitors can change, leading to an incoherent hierarchy and loss of information about the second and third best options (Hekmati et al., 2018).

Nested performance profiles address this structural limitation via recursive decomposition—whether in benchmarking, hierarchical function timing, or high-dimensional predictive performance assessment—by evaluating performance in a layered or "peeling" fashion. The methodology has been instantiated formally in software benchmarking, hardware and HPC timing frameworks, and model selection workflows using exhaustive cross-validation.

2. Mathematical Formalism in Algorithmic Benchmarking

Let $S$ denote a set of $n_s$ solvers, $P$ a set of $n_p$ test problems, and $t_{p,s}$ the time taken by solver $s$ on problem $p$ . The performance ratio is defined as

$r_{p,s} = \frac{t_{p,s}}{ \min_{s' \in S} t_{p,s'} },$

and the classic profile function for solver $s$ is

$\rho_s(\tau) = \frac{1}{n_p} \left| \left\{ p \in P : r_{p,s} \leq \tau \right\} \right|, \quad \tau \geq 1.$

This measures the fraction of problems where $s$ is within factor $\tau$ of the best.

The nested performance profile extends this by running $k$ “waves” of profiling, each time removing the current best and recomputing ratios and profiles for the reduced solver set. Formally, at wave $i$ , denote the set of active solvers by $S'$ , compute new ratios as above, and store the resulting profile $\rho^{i}_s(\tau)$ . The final profile for each solver is given by the averaged curve

$\rho_s^{\mathrm{Overall}}(\tau) = \frac{1}{k} \sum_{i=1}^k \rho^{i}_s(\tau).$

This process continues up to $k = n_s - 1$ to achieve a full ranking. Key theoretical properties established include:

Elimination-insensitivity: the global ranking does not change if the best solver is removed and the procedure rerun.
Stability: perturbations in a small number of problem instances affect overall ranking by at most $1/n_p$ in profile value.
L $_1$ robustness: if all ratios change by $\epsilon$ , then the $\mathrm{L}_1$ distance between original and perturbed profiles is bounded by $\epsilon$ .

An illustrative example demonstrates that, for three solvers (A, B, C), ordinary performance profiles can invert the B/C ranking when A is removed, whereas nested profiles yield a stable ordering for all $\tau$ (Hekmati et al., 2018).

3. Hierarchical Timing and Profiling in Software and Hardware

Nested profiling at the execution level is essential for understanding and optimizing deeply nested routines or hierarchically organized code or hardware regions. SPACE-Timers (Karademir et al., 2 Mar 2026) exemplifies a stack-based approach in C++ HPC codes:

Each timing scope is represented as a TimerNode in a tree structure, corresponding to hierarchical code regions (from coarse program phases down to fine-grained loops).
Timing metrics are recursively aggregated:
- Inclusive time: $T_{\mathrm{incl}}(i) = \sum_{k=1}^{K_i} (t_{\mathrm{pop},k}^{(i)} - t_{\mathrm{push},k}^{(i)})$ .
- Child time: $T_{\mathrm{child}}(i) = \sum_{c \in \mathrm{children}(i)} T_{\mathrm{incl}}(c)$ .
- Exclusive time: $T_{\mathrm{excl}}(i) = T_{\mathrm{incl}}(i) - T_{\mathrm{child}}(i)$ .
- Overhead as seen by parent $p$ : $T_{\mathrm{overhead}}(i) = T_{\mathrm{incl}}(p) - \sum_c T_{\mathrm{incl}}(c)$ .
Reports output a tree-structured summary of all timings, sorted and annotated at each nesting level, with “Unaccounted” time reported for completeness.

Similar hierarchical instrumentation applies on hardware. RealProbe (Kim et al., 4 Apr 2025) operates at the RTL level for high-level synthesis (HLS) design profiling:

Automated pragma-driven instrumentation propagates through all function and loop boundaries, constructing a tree of control signals mapping source-level regions to synthesized FSM modules.
Each hierarchical module and loop is timestamped on entry/exit, with cycle counts collected, externally offloaded, and reconstructed as a nested performance report spanning the entire call and loop tree.
Extensive DSE routines balance signal selection, queue depth, and resource overhead to ensure practical deployability at all nesting levels.

Both approaches enable precise attribution of performance hotspots, facilitate bottleneck identification, and support cross-level optimization.

4. Nested Performance Estimation via Exhaustive Cross-Validation

In high-dimensional predictive modeling, an analogous principle underlies exhaustive nested cross-validation for model comparison, often referred to as building a “nested performance profile” across all possible data splits (Gauran et al., 2024).

Given $N$ i.i.d. data points $(\mathbf{x}_n, y_n)$ , and two modeling algorithms (e.g., intercept-only and ridge-regularized regression), the procedure is as follows:

Outer loop (“assessment”): Enumerate all possible outer test/train splits of the data (e.g., all leave-one-out or leave-two-out subsets). Each outer fold provides an assessment error for each model.
Inner loop (“tuning”): For each outer train set, enumerate all possible inner train/test splits to tune regularization or model hyperparameters, with efficient closed-form formulas (especially for ridge regression).
Performance summary: For each outer split $\ell$ , record the difference $D_\ell = T^{(0)}_\ell - T^{(1)}_\ell$ in predictive error, compute sample mean $\overline{D}$ , its estimated standard error $S_D$ , and assemble paired $t$ -statistics or confidence intervals.

Key technical results:

Closed-form expressions using the hat matrix $H(\lambda)$ and residual vectors $r(\lambda)$ allow efficient, tractable computation of all split-based CV errors for linear models.
Simulation studies confirm that full or hybrid (e.g., LOOCV plus L2OOCV) exhaustive nested CV controls Type I error and maximizes statistical power for difference-in-error testing in high-dimensional regimes.
In RNA-Seq datasets, this approach was shown to produce robust, reproducible, and interpretable inference about the incremental predictive value of different molecular signatures (Gauran et al., 2024).

5. Algorithmic and Practical Implementation

The operational workflow for constructing a nested performance profile typically involves:

For Benchmarking Solvers (Hekmati et al., 2018):

For $k$ desired ranks, recursively remove the top solver and recompute performance profiles on the remaining set.
Aggregate each solver’s performance over all $k$ waves to produce overall nested performance curves.
Rank solvers by the area under their overall curve, or by their value at $\tau = 1$ or another threshold of practical interest.

For Hierarchical Timing (Karademir et al., 2 Mar 2026, Kim et al., 4 Apr 2025):

Instrument code or hardware at all desired control points, capturing entry and exit to nested regions.
Maintain runtime data structures (e.g., stack of timer nodes or mapped control signals).
Recursively aggregate inclusive, exclusive, and overhead times or cycle counts post-execution.
Output structured tree-based reports for analysis and optimization guidance.

For Exhaustive Nested CV (Gauran et al., 2024):

Enumerate all possible training/test splits for both outer (assessment) and inner (tuning) CV layers.
Use closed-form algebra (for ridge regression) to avoid full retraining in every fold.
Compute performance gaps, paired $t$ -statistics, and confidence intervals.
Interpret results with robust control of error rates and statistical power.

These approaches all scale in computational complexity with the number of nested partitions or regions, but closed-form and aggregation techniques make otherwise brute-force profiling feasible in practice.

6. Impact, Recommended Practice, and Limitations

Nested performance profiling delivers critical improvements in unbiased ranking, hierarchical attribution, and statistical reproducibility relative to flat or single-level methods.

Key advantages include: