Runtime-Aware Empirical Risk Minimization

Updated 17 May 2026

Runtime-aware ERM is an approach that integrates statistical accuracy with quantifiable runtime measures to optimize resource usage in machine learning.
It leverages adaptive sample sizing, truncated Hessian inversion, and streaming algorithms to achieve near-optimal risk at minimal computational cost.
These strategies enable scalable, efficient solutions by balancing trade-offs between statistical precision and computation in large-scale learning.

Runtime-aware empirical risk minimization (ERM) refers to algorithmic and theoretical frameworks that jointly consider statistical generalization objectives and explicit computational cost in the solution of ERM problems. The focus is on designing ERM procedures that achieve near-optimal risk guarantees with minimal and quantifiable runtime—often matching the statistical accuracy at the lowest computational expense permitted by worst-case complexity theory or data-dependent instance structure. This area encompasses adaptive optimization methods, data selection and coreset approaches, and streaming/online algorithms that provably balance excess risk, sample usage, and wall-clock efficiency.

1. Empirical Risk Minimization: Statistical Accuracy and Runtime Objectives

ERM seeks to find a parameter vector that minimizes the expected risk $L(w) = \mathbb{E}_z[f(w;z)]$ over a parameter space, given only a finite sample $z_1,\dots,z_n$ . The standard surrogate is the empirical risk $L_n(w) = \frac{1}{n}\sum_{i=1}^n f(w;z_i)$ . The statistical performance of ERM is typically assessed by an excess risk bound of the form

$\sup_w \big|L(w) - L_n(w)\big| \leq V_n,$

with $V_n = O(n^{-1/2})$ or $O(n^{-1})$ depending on problem regularity. In runtime-aware ERM, the focus is to achieve an estimator $\hat{w}_n$ such that $L(\hat{w}_n) - L(w^*) \lesssim V_n$ using the minimal possible computational resources—measured, for example, in terms of gradient/Hessian evaluations, data passes, time complexity, or memory (Eisen et al., 2017, Frostig et al., 2014).

Strong convexity is often imposed by a vanishing Tikhonov regularization term (of order $V_n$ ) to ensure a well-conditioned empirical problem:

$R_n(w) = L_n(w) + \frac{c V_n}{2} \|w\|^2,$

making $z_1,\dots,z_n$ 0 $z_1,\dots,z_n$ 1-strongly convex. The runtime-aware goal is to return $z_1,\dots,z_n$ 2 such that $z_1,\dots,z_n$ 3—solving to statistical accuracy, but not significantly beyond, to avoid over-computation relative to the data’s inherent uncertainty (Eisen et al., 2017).

2. Adaptive and Truncated Second-Order Methods

In large-scale ERM regimes, exact second-order (Newton) solvers are often rendered infeasible due to the $z_1,\dots,z_n$ 4 cost of inverting a dense $z_1,\dots,z_n$ 5 Hessian. Runtime-aware solutions employ two key strategies:

Adaptive sample size: Start ERM with a small data subset, solve to its statistical accuracy, then geometrically increase the sample size ( $z_1,\dots,z_n$ 6). Each larger problem is warm-started from the previous solution, tracking the statistical accuracy required at each scale.
Truncated Hessian inversion: Instead of inverting the full Hessian, approximate it with a truncated eigenvalue decomposition, yielding a low-rank plus diagonal structure. Inverting this structure has $z_1,\dots,z_n$ 7 complexity for $z_1,\dots,z_n$ 8 dominant eigenvalues, compared to $z_1,\dots,z_n$ 9 otherwise.

The k-Truncated Adaptive Newton (k-TAN) method (Eisen et al., 2017) formalizes this process. At each stage, the algorithm:

Constructs the sample-specific gradient and Hessian.
Performs eigendecomposition to identify $L_n(w) = \frac{1}{n}\sum_{i=1}^n f(w;z_i)$ 0 large-eigenvalue (“signal”) directions.
Truncates the Hessian at level $L_n(w) = \frac{1}{n}\sum_{i=1}^n f(w;z_i)$ 1 and inverts efficiently.
Takes a single Newton-like step using this truncated inverse.
Verifies statistical accuracy; otherwise, adapts the truncation threshold and repeats.

This staged approach ensures that each subproblem is solved only to within the sample’s statistical accuracy, resulting in geometric convergence and requiring $L_n(w) = \frac{1}{n}\sum_{i=1}^n f(w;z_i)$ 2 stages and approximately two effective data passes in total (Eisen et al., 2017).

3. Streaming and Single-Pass Stochastic Algorithms

Linear-time, single-pass algorithms—such as Streaming SVRG (Frostig et al., 2014)—provide explicit runtime-aware guarantees, achieving ERM-level excess risk rates $L_n(w) = \frac{1}{n}\sum_{i=1}^n f(w;z_i)$ 3 within $L_n(w) = \frac{1}{n}\sum_{i=1}^n f(w;z_i)$ 4 time and $L_n(w) = \frac{1}{n}\sum_{i=1}^n f(w;z_i)$ 5 memory. The Streaming SVRG algorithm:

Alternates between anchor-gradient computation on mini-batches and variance-reduced SGD-style updates.
Sequentially updates the stage parameters, leveraging fresh i.i.d. samples for both components.
Delivers excess risk

$L_n(w) = \frac{1}{n}\sum_{i=1}^n f(w;z_i)$ 6

for any fixed $L_n(w) = \frac{1}{n}\sum_{i=1}^n f(w;z_i)$ 7 and batch-growth rate $L_n(w) = \frac{1}{n}\sum_{i=1}^n f(w;z_i)$ 8.

The approach matches classical ERM, with initial error decaying super-polynomially in $L_n(w) = \frac{1}{n}\sum_{i=1}^n f(w;z_i)$ 9, and benefits from parallelizable anchor gradient steps. Such streaming methods are runtime-optimal under strong convexity and smoothness assumptions (Frostig et al., 2014).

4. Data Selection and Coreset Principles

A complementary runtime-aware paradigm is sample selection: optimally subsampling or weighting data to minimize empirical risk, aiming for the smallest $\sup_w \big|L(w) - L_n(w)\big| \leq V_n,$ 0 that preserves generalization performance attained by the full dataset. Theoretical results rigorously specify minimax (or near-minimax) sample requirements for multiple canonical ERMs (Hanneke et al., 20 Apr 2025):

Mean estimation: $\sup_w \big|L(w) - L_n(w)\big| \leq V_n,$ 1 points are sufficient to induce at most a factor $\sup_w \big|L(w) - L_n(w)\big| \leq V_n,$ 2 loss inflation.
Linear regression: $\sup_w \big|L(w) - L_n(w)\big| \leq V_n,$ 3 points suffice to retain optimal loss; $\sup_w \big|L(w) - L_n(w)\big| \leq V_n,$ 4 points via volume sampling incur at most a $\sup_w \big|L(w) - L_n(w)\big| \leq V_n,$ 5 factor (Hanneke et al., 20 Apr 2025).
Linear classification: $\sup_w \big|L(w) - L_n(w)\big| \leq V_n,$ 6 points yield zero error in realizable settings, while $\sup_w \big|L(w) - L_n(w)\big| \leq V_n,$ 7 can be no better than $\sup_w \big|L(w) - L_n(w)\big| \leq V_n,$ 8 error.
General stochastic convex optimization (SCO): Strict convexity with $\sup_w \big|L(w) - L_n(w)\big| \leq V_n,$ 9 ensures no loss inflation.

Algorithmically, efficient Carathéodory-type methods, Steinitz-style gradient selection, and volume sampling provide practical, runtime-aware procedures to downsample, enabling orders-of-magnitude training speedups for expensive ERMs when $V_n = O(n^{-1/2})$ 0 (Hanneke et al., 20 Apr 2025). The preprocessing cost is $V_n = O(n^{-1/2})$ 1, justified when ERM itself scales superlinearly in $V_n = O(n^{-1/2})$ 2 or the dimension.

5. Fine-Grained Lower Bounds and Complexity Barriers

Fine-grained complexity analysis establishes lower bounds on the runtime required to reach specific statistical targets in kernel and neural network ERMs (Backurs et al., 2017):

Under the Strong Exponential Time Hypothesis (SETH), exact or high-accuracy solutions for kernel SVMs and kernel ridge regression—specifically, $V_n = O(n^{-1/2})$ 3-multiplicative approximations—require $V_n = O(n^{-1/2})$ 4 time for $V_n = O(n^{-1/2})$ 5 data points.
Gradient computation for multilayer networks incurs $V_n = O(n^{-1/2})$ 6 time for $V_n = O(n^{-1/2})$ 7 weights and $V_n = O(n^{-1/2})$ 8 samples.
These results imply that subquadratic algorithms for general-case kernel SVM/KRR to high precision are unlikely unless strong distributional structure is exploited.
Consequently, runtime-aware practitioners are driven toward:
- Stochastic or streaming schemes with error scaling polynomially in $V_n = O(n^{-1/2})$ 9.
- Low-rank kernel approximations (e.g., Nyström, random Fourier features) to trade accuracy for computational cost.
- Data-dependent techniques (coresets, data selection) when exploitable structure is present (Backurs et al., 2017).

This delineates the “no free lunch” boundary for runtime-aware ERM, motivating approximate or instance-structured methods in large-scale machine learning.

6. Trade-offs, Assumptions, and Practical Implications

Runtime-aware ERM involves a fundamental balance between statistical precision, computational tractability, and data complexity. Notable trade-offs and considerations include:

Precision vs. runtime: Solving only to the statistical accuracy $O(n^{-1})$ 0 prevents over-computation and enables large step sizes or aggressive computational shortcuts (e.g., Hessian truncation, coreset reduction). Aggressive eigenvalue truncation in second-order methods leads to significant savings when the Hessian’s spectrum decays sufficiently fast (Eisen et al., 2017).
Assumptions: Most runtime-aware guarantees require convexity, strong convexity (possibly via adaptive regularization), self-concordance, and bounded gradient-difference or Lipschitz-type conditions (Eisen et al., 2017, Frostig et al., 2014). Some coreset constructions necessitate strict convexity or realizability; in its absence, multiplicative loss inflation is unavoidable (Hanneke et al., 20 Apr 2025).
Limitations: The worst-case dimension-dependent or spectrum-structure-dependent costs can render otherwise efficient algorithms no better than naive batch approaches. For non-decaying Hessian spectra, truncated second-order updates lose their computational advantage.
Empirical results: Runtime-aware methods such as k-TAN and Streaming SVRG outperform standard stochastic or full-batch Newton-type methods in both wall-clock and sample complexity at large scale, particularly when the aforementioned structure is present (Eisen et al., 2017, Frostig et al., 2014).

7. Open Directions and Frontier Challenges

Open research directions in runtime-aware ERM include:

Sharper bounds for data selection: Refining worst-case ratios for unweighted linear regression when $O(n^{-1})$ 1; pinning down intermediate classification regret rates between $O(n^{-1})$ 2 and $O(n^{-1})$ 3 (Hanneke et al., 20 Apr 2025).
Additive vs. multiplicative guarantees: Developing fine-grained additive excess risk bounds under smoothness and convexity conditions.
Relaxed continuity and nonconvex analysis: Determining whether “mildly discontinuous” ERM rules can break lower-bound barriers; extending coreset and single-pass results to nonconvex objectives such as deep networks.
Instance-optimal and data-centric ERM: Integrating runtime-aware data selection with distributed, streaming, and active learning pipelines for modern large-scale and distributed systems.
Empirical validation: Large-scale experiments to measure real-world speed-accuracy trade-offs of modern runtime-aware ERM techniques.

By unifying theoretical complexity, algorithmic design, and statistical learning perspectives, runtime-aware ERM continues to shape the landscape of scalable and efficient machine learning practice.

Markdown Report Issue Upgrade to Chat

References (4)

Large Scale Empirical Risk Minimization via Truncated Adaptive Newton Method (2017)

Competing with the Empirical Risk Minimizer in a Single Pass (2014)

Data Selection for ERMs (2025)

On the Fine-Grained Complexity of Empirical Risk Minimization: Kernel Methods and Neural Networks (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Runtime-Aware Empirical Risk Minimization (ERM).