Time-Partitioned Benchmarking Paradigm

Updated 14 January 2026

Time-Partitioned Benchmarking Paradigm is a framework that segments time to isolate evaluation phases, ensuring reproducible and interpretable metrics.
It is applied in real-time systems, time-series forecasting, and metaheuristic optimization to prevent cross-partition interference and information leakage.
The paradigm uses strict scheduling protocols, defined temporal splits, and uniform wall-clock budgets to deliver fair and actionable performance assessments.

The Time-Partitioned Benchmarking Paradigm is a foundational framework for evaluating systems or algorithms under temporally segmented constraints, with rigorous protocols ensuring measurement fidelity, comparability, and isolation. It has prominent instantiations in real-time partitioned systems for avionics and aerospace (notably via SFPBench and ARINC-653), time-series forecasting with foundation models, and metaheuristic optimization under fixed wall-clock budgets. The paradigm enforces well-defined temporal or sequential data splits—or equal wall-clock execution intervals—upon which all measurement and comparison are based. This approach mitigates cross-partition interference, leakage, or confounds, and yields interpretable, reproducible performance metrics relevant to the underlying temporal properties of the systems under test.

1. Formal Definitions and Domain-Specific Instantiations

The time-partitioned benchmarking paradigm is anchored in rigorous, repeatable partitioning of time (or time-indexed data) to ensure that evaluation, training, and operation are temporally isolated.

Partitioned Real-Time Systems: In ARINC-653-based platforms, time is partitioned into cyclically repeating major time frames $H = \sum_{i=1}^m \Delta_i$ , with individual minor time frames $\Delta_i$ exclusively allocated to distinct partitions $P_i$ (Magalhaes et al., 2020). Performance and schedulability analysis are performed per-partition, ensuring temporal and spatial isolation.
Time Series Forecasting: Time-indexed datasets $\{x_t\}_{t=1}^T$ are split at indices $(\tau_1,\tau_2)$ : training on $[1, \tau_1]$ , optional validation on $[\tau_1 + 1, \tau_2]$ , and evaluation on $[\tau_2 + 1, T]$ . Benchmarks are tuples $(\{x_t\}_{t=1}^T, \tau_1, \tau_2, \mathcal{M}, \mathcal{E})$ , with strict prohibition on test-period leakage into model selection or training (Meyer et al., 15 Oct 2025).
Metaheuristics/Optimization: Every algorithm $A_s$ is allocated a strict wall-clock time $T_i$ per problem instance $i$ , within which it may execute any sequence of restarts, early terminations, or adaptivity. All comparisons are indexed to this interval, rendering per-evaluation computational effort transparent and fair (Lian, 10 Sep 2025).

2. Scheduling, Isolation, and Fairness Properties

A central goal is to ensure that measurements in one temporal segment are not contaminated by operations or data from another.

Real-Time Systems: Scheduling is two-level:
- Global scheduler: Perfectly cyclic allocation of CPU time slices $\Delta_i$ to partitions.
- Local scheduler: Fixed-priority preemptive scheduling within a partition’s minor frame.
- Strict temporal isolation: At partition switch boundaries, the RTOS may flush/invalidate caches, save and restore core contexts, and mask interrupts.
- Formal schedulability criterion: $\sum_{\tau_j \in P_i} C_j / T_j \leq \Delta_i / H \leq 1$ , where $C_j$ is the WCET bound and $T_j$ the period (Magalhaes et al., 2020).
Time Series Benchmarks: No elements from the test set $\mathcal{T}_\mathrm{test}$ may be used for training, model selection, or hyperparameter tuning. Careful enforcement prevents information leakage, global pattern memorization, and dataset reuse artifacts (Meyer et al., 15 Oct 2025).
Metaheuristics: Time-fair benchmarking (restart-fairness) gives all solvers the same wall-clock resource ( $T_i$ ) and applies no limit on the number of runs, seeds, or internal adaptation per solver, as long as the overall elapsed time constraint is honored (Lian, 10 Sep 2025).

3. Protocols, Workflows, and Measurement Methodologies

Protocols in time-partitioned benchmarking are precise, supporting per-partition or per-window isolation, structured measurement, and data integrity.

Domain	Partitioning Unit	Enforced Isolation
Real-Time	Minor Time Frame ( $\Delta_i$ )	Temporal & spatial (ARINC-653)
Time Series	Index cut-points ( $\tau_k$ )	Disjoint train/val/test
Metaheuristics	Wall-clock time ( $T_i$ )	Equal time budget, no overlap

SFPBench Workflow:

1. Configure partitions ( $H$ , $\Delta_i$ , $O_i$ in ARINC-653 table); assign benchmark applications per partition. 2. At partition start, initialize timing variables via macros and abstraction layer (DECLARE_TIME_MEASURE, START_TIME_MEASURE, END_TIME_MEASURE). 3. For $N$ iterations, measure each target code region or primitive. 4. Aggregate and validate timing data. Store via OS trace/logging facilities. 5. Employ trusted hardware tracer for cross-validation ( $\leq$ 2% error rate on BCET/WCET) (Magalhaes et al., 2020).

Time-Series Benchmarks:
- Use explicit, immutable split indices; ensure no train/test/validation overlap.
- Recommended: Rolling-origin evaluation (cross-validation by windows), truly future out-of-sample testing (all pre-training up to cut-off $\tau^*$ ), and domain-level splits (masking entire sources for zero-shot transfer).
- Public documentation of splits, benchmarks, and model provenance is required for transparency and reproducibility (Meyer et al., 15 Oct 2025).
Metaheuristic Protocol:
- Algorithmic recipe: For each problem instance $i$ and solver $A_s$ , repeat:
- Start from fresh seed/configuration; run algorithm until success, stagnation, or $T_i$ exhausted.
- Track and aggregate anytime performance, success rates, ERT, and profiles.
- No restriction on restarts, adaptivity, or internal heuristics other than total time $T_i$ (Lian, 10 Sep 2025).

4. Performance Metrics and Statistical Analysis

Performance in time-partitioned paradigms is evaluated through metrics grounded in time segmentation or fair time allocation.

Real-Time Metrics (SFPBench):
- Best-Case Execution Time (BCET): $\min_k s_k$
- Worst-Case Execution Time (WCET): $\max_k s_k$ ( $C_i^{max}$ )
- Average: $(1/N)\sum_k s_k$ ; Std-Dev: $\sqrt{(1/N)\sum_k (s_k-\text{Average})^2}$
- Fine-grained: context-switch times, partition-switch, synchronization primitives, APEX-API latency, full application WCET, partition-interference (Magalhaes et al., 2020).
Time-Series Metrics:
- Rolling RMSE and MAPE accumulated over all forecast origins and horizons:
$\mathrm{RMSE} = \sqrt{\frac{1}{K h} \sum_{k=1}^K \sum_{i=1}^h (\hat{x}_{\tau^{(k)}+i} - x_{\tau^{(k)}+i})^2}$

$\mathrm{MAPE} = \frac{100\%}{K h} \sum_{k=1}^K \sum_{i=1}^h \left|\frac{\hat{x}_{\tau^{(k)}+i} - x_{\tau^{(k)}+i}}{x_{\tau^{(k)}+i}}\right|$ - Sensitivity to split design and forecast horizon scrutinized; statistical robustness is bolstered by rolling evaluation (Meyer et al., 15 Oct 2025).
Metaheuristic Optimization:
- Anytime performance curves $f_{s,i}(t)$ : best-so-far objective at each $t \leq T_i$ .
- Expected Running Time to target $q$ :
$ERT_{s,i}(q) = \frac{\sum_{r=1}^{K_{s,i}} t_{s,i,r}}{S_{s,i}}, \quad S_{s,i} \geq 1;$

or $\infty$ otherwise. - Time-based performance profiles $\rho_s(\alpha)$ : fraction of instances solved within $\alpha$ times the best on each instance. - Repeated independent trials for confidence intervals and robust reporting (Lian, 10 Sep 2025).

5. Illustrative Case Studies and Quantitative Results

Concrete deployments reveal the diagnostic and optimization implications of the paradigm.

SFPBench on P2020RDB-PC with ARINCRTOS:
- NXP/QorIQ P2020 (dual Power e500, 1.2 GHz CPU).
- Process-switch time: BCET = 1.50 μs, WCET = 17.20 μs.
- Partition-switch time: BCET = 22.16 μs, WCET = 40.74 μs.
- Full-application (Sobel) WCET: 5.25 ms; ADPCM algorithm: 391 μs (Magalhaes et al., 2020).
Time-Series Foundation Models:
- Evaluation pitfalls documented for Monash “Australian Electricity Demand” and ETTh1 datasets, where pre-training/test split confusion led to inflated model performance via leakage (Meyer et al., 15 Oct 2025).
Metaheuristics (PSO example):
- Under $T = 50$ s, standard PSO achieving five restarts (10s each) can outperform a more computationally burdensome variant able to run only once for 50s—demonstrating the practical tradeoff made visible by time-partitioned (wall-clock) protocol (Lian, 10 Sep 2025).

6. Best Practices, Pitfalls, and Reporting Guidelines

Best Practices:
- Enforce strict temporal/data disjointness between train, validation, and test partitions; publish explicit split indices and partition membership.
- Employ rolling and domain cross-validation to assess both temporal and domain generalization (Meyer et al., 15 Oct 2025).
- Use time-based, not only count-based, budgets for metaheuristics; report number of restarts and solution quality via anytime curves and ERT (Lian, 10 Sep 2025).
- Maintain user-level measurement with minimal system intrusion (e.g., via macro abstraction layers for real-time systems).
- Validate measurements with trusted external methods (e.g., hardware tracer error $\leq$ 2%) and integrate measured $C_i^{max}$ directly into schedulability analysis (Magalhaes et al., 2020).
Pitfalls:
- Overlapping data windows or non-disjoint splits induce information leakage.
- Unreported computational costs and asymmetries in overheads generate unfair comparisons.
- Spurious pattern memorization across partition boundaries in time series modeling can yield misleading gains (Meyer et al., 15 Oct 2025).
Reporting Guidelines:
- Budget specification (wall-clock time, FE/iteration limits), restart policies, target and metric definitions, complete environment disclosure, tuning/reporting overhead accounting, and open artifact provision are all prescribed (Lian, 10 Sep 2025).

7. Extensibility, Scalability, and Cross-Domain Relevance

The time-partitioned benchmarking paradigm is extensible and adaptable across domains:

New “grey-box” test cases (e.g., memory-allocation, interrupt latency) can be added in real-time frameworks by bracketing code regions with timing macros (Magalhaes et al., 2020).
Spatiotemporal and domain-level splits generalize the paradigm for multi-source or panel data scenarios in time series (Meyer et al., 15 Oct 2025).
Scalability is facilitated by future-live benchmarking (forecasting on day- $D+1$ as data arrives) and by the ability to incorporate new solver strategies, adaptive restarts, or hybrid evaluation criteria (Lian, 10 Sep 2025).

By integrating static temporal partitioning, wall-clock equality, and rigorous statistical methods, the time-partitioned benchmarking paradigm provides the infrastructure for reproducible, interpretable, and domain-agnostic system evaluation.