Papers
Topics
Authors
Recent
2000 character limit reached

Coordinate-Wise Online Mini-Batch SGD

Updated 17 October 2025
  • Coordinate-wise online mini-batch SGD is a family of optimization algorithms that update model parameters using adaptive per-coordinate learning rates and diagonal preconditioning.
  • These methods achieve improved convergence and tighter regret bounds by dynamically scaling updates based on the cumulative geometry of past gradients.
  • The algorithm is computationally efficient, making it ideal for high-dimensional, online convex optimization and large-scale, streaming data applications.

Coordinate-wise online mini-batch stochastic gradient descent (SGD) comprises a family of optimization algorithms that update model parameters in high-dimensional spaces by drawing and aggregating mini-batches online—often performing updates to individual coordinates or blocks of coordinates, with adaptive stepsizes and sometimes with problem-specific conditioning. These algorithms are grounded in rigorous regret analysis, are amenable to parallelization, and have provable benefits for online convex optimization and large-scale learning (Streeter et al., 2010). Below, the essential aspects are outlined, including adaptive per-coordinate updates, diagonal preconditioning, improved regret bounds, computational and deployment considerations, and practical applications.

1. Adaptive Per-Coordinate Updates

The fundamental innovation in coordinate-wise online mini-batch SGD lies in the use of adaptive, per-coordinate learning rates. Rather than a single global stepsize, the update rule for each coordinate ii exploits the cumulative geometry of past gradients. Specifically, at time tt, the update for coordinate ii is

xt+1,i=xt,iηgt,idt,ix_{t+1,i} = x_{t,i} - \eta \frac{g_{t,i}}{d_{t,i}}

where

dt,i=s=1tgs,i2d_{t,i} = \sqrt{\sum_{s=1}^t g_{s,i}^2}

is a per-coordinate accumulator tracking the historical squared gradient magnitudes. In vector form, the update is

xt+1=xtηDt1gt\mathbf{x}_{t+1} = \mathbf{x}_t - \eta D_t^{-1} \mathbf{g}_t

with Dt=diag(dt,1,,dt,n)D_t = \operatorname{diag}(d_{t,1}, \ldots, d_{t,n}). This adaptive scaling ensures that frequently updated (high-variance) coordinates are stepped more conservatively, while rarely active (low-variance) coordinates receive larger updates (Streeter et al., 2010).

2. Diagonal Preconditioning and Conditioning

By introducing a diagonal preconditioner DtD_t, the algorithm realizes online conditioning—analogous to batch algorithms that precondition with the Hessian or its diagonal. This is crucial in high-dimensional problems where coordinate scales vary drastically, or when the objective geometry is anisotropic:

  • Coordinates with large accumulated squared gradients (dt,id_{t,i}) are associated with directions of high curvature or frequent activity, and are updated cautiously.
  • Directions with small dt,id_{t,i} (inactive or flat) receive proportionally larger steps.

This scheme leverages coordinate-wise historical information and yields a form of normalization that adapts continually to the data stream.

3. Regret Bounds and Theoretical Guarantees

The regret of the adaptive coordinate-wise online mini-batch SGD is bounded as

Regret=O(i=1nXit=1Tgt,i2)\text{Regret} = O\left(\sum_{i=1}^n X_i \sqrt{\sum_{t=1}^T g_{t,i}^2}\right)

where XiX_i is an upper bound on the range of the ii-th coordinate. This is strictly stronger than standard online gradient descent (OGD) bounds that scale as O(T)O(\sqrt{T}), particularly when the gradient magnitudes t=1Tgt,i2\sum_{t=1}^T g_{t,i}^2 are highly non-uniform across coordinates. The per-coordinate regret reflects the geometry of the problem, and in regimes where many coordinates are rarely activated, it leads to dramatically tighter bounds compared to non-adaptive algorithms (Streeter et al., 2010).

4. Empirical Performance and Computational Efficiency

Empirical benchmarks demonstrate several key properties:

  • For problems with very high-dimensional feature spaces, adaptive per-coordinate SGD leads to faster convergence and improved generalization compared to global rate approaches.
  • Updates are efficient, requiring only O(n)O(n) time and storage per iteration due to the diagonal structure of DtD_t.
  • When compared to state-of-the-art online learning algorithms (including both OGD and more exotic adaptive methods), the adaptive approach yields comparable or superior predictive accuracy, robustly handling varying feature scales in streaming or internet-scale applications.

5. Implementation Considerations and Deployment

To deploy coordinate-wise online mini-batch SGD:

  • Track a running quadratic accumulator dt,id_{t,i} for each coordinate; update in O(n)O(n) per iteration.
  • Design the mini-batch mechanism to compute unbiased stochastic gradient vectors gt\mathbf{g}_t at each time step.
  • Normalize updates by Dt1D_t^{-1} as outlined.
  • Step-size η\eta can be treated as a global hyperparameter, typically tuned for robust performance; in some contexts, further tuning or adaptive strategies per coordinate may be beneficial.
  • The algorithm is suitable for streaming, distributed, or parallelized environments, supporting large-scale deployment with minimal per-iteration resource requirements.

6. Practical Applications and Extensions

Adaptive coordinate-wise online mini-batch SGD is well-suited for:

  • Large-scale online convex optimization
  • Sparse high-dimensional machine learning problems
  • Scenarios with non-uniform feature activation or heavy-tailed gradient distributions
  • Internet-scale and streaming data applications where feature relevance and scales evolve over time

It enables both improved statistical efficiency—via regret minimization tailored to observed data geometry—and scalable, high-throughput learning in computationally constrained settings.

The approach is related to on-line conditioning, Adagrad-style adaptivity, and diagonal preconditioning frameworks. Its per-coordinate adaptation can be viewed as a forerunner to subsequent adaptive learning rate schemes that proliferate in deep learning and large-scale optimization, and its theoretical rigor underpins a broad class of modern stochastic optimization algorithms.


Summary Table: Coordinate-wise Online Mini-Batch SGD: Properties

Aspect Technique/Formula Benefit/Implication
Per-coordinate update xt+1,i=xt,iηgt,idt,ix_{t+1,i} = x_{t,i} - \eta \frac{g_{t,i}}{d_{t,i}} Adaptive step-size per coordinate
Diagonal precondition Dt=D_t = diag of dt,id_{t,i} Data-driven normalization for gradient updates
Regret bound O(iXitgt,i2)O(\sum_i X_i \sqrt{\sum_t g_{t,i}^2}) Tighter theoretical guarantees than standard OGD
Empirical efficiency O(n)O(n) per iteration Suitable for large-scale, high-dimensional data

This algorithmic class, as developed and analyzed in (Streeter et al., 2010), provides both the theoretical and practical foundation for efficient and reliable online learning strategies in high-dimensional and streaming contexts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Coordinate-Wise Online Mini-Batch SGD.