Coordinate-Wise Online Mini-Batch SGD

Updated 17 October 2025

Coordinate-wise online mini-batch SGD is a family of optimization algorithms that update model parameters using adaptive per-coordinate learning rates and diagonal preconditioning.
These methods achieve improved convergence and tighter regret bounds by dynamically scaling updates based on the cumulative geometry of past gradients.
The algorithm is computationally efficient, making it ideal for high-dimensional, online convex optimization and large-scale, streaming data applications.

Coordinate-wise online mini-batch stochastic gradient descent (SGD) comprises a family of optimization algorithms that update model parameters in high-dimensional spaces by drawing and aggregating mini-batches online—often performing updates to individual coordinates or blocks of coordinates, with adaptive stepsizes and sometimes with problem-specific conditioning. These algorithms are grounded in rigorous regret analysis, are amenable to parallelization, and have provable benefits for online convex optimization and large-scale learning (Streeter et al., 2010). Below, the essential aspects are outlined, including adaptive per-coordinate updates, diagonal preconditioning, improved regret bounds, computational and deployment considerations, and practical applications.

1. Adaptive Per-Coordinate Updates

The fundamental innovation in coordinate-wise online mini-batch SGD lies in the use of adaptive, per-coordinate learning rates. Rather than a single global stepsize, the update rule for each coordinate $i$ exploits the cumulative geometry of past gradients. Specifically, at time $t$ , the update for coordinate $i$ is

$x_{t+1,i} = x_{t,i} - \eta \frac{g_{t,i}}{d_{t,i}}$

where

$d_{t,i} = \sqrt{\sum_{s=1}^t g_{s,i}^2}$

is a per-coordinate accumulator tracking the historical squared gradient magnitudes. In vector form, the update is

$\mathbf{x}_{t+1} = \mathbf{x}_t - \eta D_t^{-1} \mathbf{g}_t$

with $D_t = \operatorname{diag}(d_{t,1}, \ldots, d_{t,n})$ . This adaptive scaling ensures that frequently updated (high-variance) coordinates are stepped more conservatively, while rarely active (low-variance) coordinates receive larger updates (Streeter et al., 2010).

2. Diagonal Preconditioning and Conditioning

By introducing a diagonal preconditioner $D_t$ , the algorithm realizes online conditioning—analogous to batch algorithms that precondition with the Hessian or its diagonal. This is crucial in high-dimensional problems where coordinate scales vary drastically, or when the objective geometry is anisotropic:

Coordinates with large accumulated squared gradients ( $d_{t,i}$ ) are associated with directions of high curvature or frequent activity, and are updated cautiously.
Directions with small $d_{t,i}$ (inactive or flat) receive proportionally larger steps.

This scheme leverages coordinate-wise historical information and yields a form of normalization that adapts continually to the data stream.

3. Regret Bounds and Theoretical Guarantees

The regret of the adaptive coordinate-wise online mini-batch SGD is bounded as

$\text{Regret} = O\left(\sum_{i=1}^n X_i \sqrt{\sum_{t=1}^T g_{t,i}^2}\right)$

where $X_i$ is an upper bound on the range of the $i$ -th coordinate. This is strictly stronger than standard online gradient descent (OGD) bounds that scale as $O(\sqrt{T})$ , particularly when the gradient magnitudes $\sum_{t=1}^T g_{t,i}^2$ are highly non-uniform across coordinates. The per-coordinate regret reflects the geometry of the problem, and in regimes where many coordinates are rarely activated, it leads to dramatically tighter bounds compared to non-adaptive algorithms (Streeter et al., 2010).

4. Empirical Performance and Computational Efficiency

Empirical benchmarks demonstrate several key properties:

For problems with very high-dimensional feature spaces, adaptive per-coordinate SGD leads to faster convergence and improved generalization compared to global rate approaches.
Updates are efficient, requiring only $O(n)$ time and storage per iteration due to the diagonal structure of $D_t$ .
When compared to state-of-the-art online learning algorithms (including both OGD and more exotic adaptive methods), the adaptive approach yields comparable or superior predictive accuracy, robustly handling varying feature scales in streaming or internet-scale applications.

5. Implementation Considerations and Deployment

To deploy coordinate-wise online mini-batch SGD:

Track a running quadratic accumulator $d_{t,i}$ for each coordinate; update in $O(n)$ per iteration.
Design the mini-batch mechanism to compute unbiased stochastic gradient vectors $\mathbf{g}_t$ at each time step.
Normalize updates by $D_t^{-1}$ as outlined.
Step-size $\eta$ can be treated as a global hyperparameter, typically tuned for robust performance; in some contexts, further tuning or adaptive strategies per coordinate may be beneficial.
The algorithm is suitable for streaming, distributed, or parallelized environments, supporting large-scale deployment with minimal per-iteration resource requirements.

6. Practical Applications and Extensions

Adaptive coordinate-wise online mini-batch SGD is well-suited for:

Large-scale online convex optimization
Sparse high-dimensional machine learning problems
Scenarios with non-uniform feature activation or heavy-tailed gradient distributions
Internet-scale and streaming data applications where feature relevance and scales evolve over time

It enables both improved statistical efficiency—via regret minimization tailored to observed data geometry—and scalable, high-throughput learning in computationally constrained settings.

The approach is related to on-line conditioning, Adagrad-style adaptivity, and diagonal preconditioning frameworks. Its per-coordinate adaptation can be viewed as a forerunner to subsequent adaptive learning rate schemes that proliferate in deep learning and large-scale optimization, and its theoretical rigor underpins a broad class of modern stochastic optimization algorithms.

Summary Table: Coordinate-wise Online Mini-Batch SGD: Properties

Aspect	Technique/Formula	Benefit/Implication
Per-coordinate update	$x_{t+1,i} = x_{t,i} - \eta \frac{g_{t,i}}{d_{t,i}}$	Adaptive step-size per coordinate
Diagonal precondition	$D_t =$ diag of $d_{t,i}$	Data-driven normalization for gradient updates
Regret bound	$O(\sum_i X_i \sqrt{\sum_t g_{t,i}^2})$	Tighter theoretical guarantees than standard OGD
Empirical efficiency	$O(n)$ per iteration	Suitable for large-scale, high-dimensional data

This algorithmic class, as developed and analyzed in (Streeter et al., 2010), provides both the theoretical and practical foundation for efficient and reliable online learning strategies in high-dimensional and streaming contexts.

PDF Markdown Chat (Pro)

References (1)

Less Regret via Online Conditioning (2010)

Follow Topic

Get notified by email when new papers are published related to Coordinate-Wise Online Mini-Batch SGD.