Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 24 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 113 tok/s Pro
Kimi K2 216 tok/s Pro
GPT OSS 120B 428 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Coordinate-Wise Online Mini-Batch SGD

Updated 17 October 2025
  • Coordinate-wise online mini-batch SGD is a family of optimization algorithms that update model parameters using adaptive per-coordinate learning rates and diagonal preconditioning.
  • These methods achieve improved convergence and tighter regret bounds by dynamically scaling updates based on the cumulative geometry of past gradients.
  • The algorithm is computationally efficient, making it ideal for high-dimensional, online convex optimization and large-scale, streaming data applications.

Coordinate-wise online mini-batch stochastic gradient descent (SGD) comprises a family of optimization algorithms that update model parameters in high-dimensional spaces by drawing and aggregating mini-batches online—often performing updates to individual coordinates or blocks of coordinates, with adaptive stepsizes and sometimes with problem-specific conditioning. These algorithms are grounded in rigorous regret analysis, are amenable to parallelization, and have provable benefits for online convex optimization and large-scale learning (Streeter et al., 2010). Below, the essential aspects are outlined, including adaptive per-coordinate updates, diagonal preconditioning, improved regret bounds, computational and deployment considerations, and practical applications.

1. Adaptive Per-Coordinate Updates

The fundamental innovation in coordinate-wise online mini-batch SGD lies in the use of adaptive, per-coordinate learning rates. Rather than a single global stepsize, the update rule for each coordinate ii exploits the cumulative geometry of past gradients. Specifically, at time tt, the update for coordinate ii is

xt+1,i=xt,iηgt,idt,ix_{t+1,i} = x_{t,i} - \eta \frac{g_{t,i}}{d_{t,i}}

where

dt,i=s=1tgs,i2d_{t,i} = \sqrt{\sum_{s=1}^t g_{s,i}^2}

is a per-coordinate accumulator tracking the historical squared gradient magnitudes. In vector form, the update is

xt+1=xtηDt1gt\mathbf{x}_{t+1} = \mathbf{x}_t - \eta D_t^{-1} \mathbf{g}_t

with Dt=diag(dt,1,,dt,n)D_t = \operatorname{diag}(d_{t,1}, \ldots, d_{t,n}). This adaptive scaling ensures that frequently updated (high-variance) coordinates are stepped more conservatively, while rarely active (low-variance) coordinates receive larger updates (Streeter et al., 2010).

2. Diagonal Preconditioning and Conditioning

By introducing a diagonal preconditioner DtD_t, the algorithm realizes online conditioning—analogous to batch algorithms that precondition with the Hessian or its diagonal. This is crucial in high-dimensional problems where coordinate scales vary drastically, or when the objective geometry is anisotropic:

  • Coordinates with large accumulated squared gradients (dt,id_{t,i}) are associated with directions of high curvature or frequent activity, and are updated cautiously.
  • Directions with small dt,id_{t,i} (inactive or flat) receive proportionally larger steps.

This scheme leverages coordinate-wise historical information and yields a form of normalization that adapts continually to the data stream.

3. Regret Bounds and Theoretical Guarantees

The regret of the adaptive coordinate-wise online mini-batch SGD is bounded as

Regret=O(i=1nXit=1Tgt,i2)\text{Regret} = O\left(\sum_{i=1}^n X_i \sqrt{\sum_{t=1}^T g_{t,i}^2}\right)

where XiX_i is an upper bound on the range of the ii-th coordinate. This is strictly stronger than standard online gradient descent (OGD) bounds that scale as O(T)O(\sqrt{T}), particularly when the gradient magnitudes t=1Tgt,i2\sum_{t=1}^T g_{t,i}^2 are highly non-uniform across coordinates. The per-coordinate regret reflects the geometry of the problem, and in regimes where many coordinates are rarely activated, it leads to dramatically tighter bounds compared to non-adaptive algorithms (Streeter et al., 2010).

4. Empirical Performance and Computational Efficiency

Empirical benchmarks demonstrate several key properties:

  • For problems with very high-dimensional feature spaces, adaptive per-coordinate SGD leads to faster convergence and improved generalization compared to global rate approaches.
  • Updates are efficient, requiring only O(n)O(n) time and storage per iteration due to the diagonal structure of DtD_t.
  • When compared to state-of-the-art online learning algorithms (including both OGD and more exotic adaptive methods), the adaptive approach yields comparable or superior predictive accuracy, robustly handling varying feature scales in streaming or internet-scale applications.

5. Implementation Considerations and Deployment

To deploy coordinate-wise online mini-batch SGD:

  • Track a running quadratic accumulator dt,id_{t,i} for each coordinate; update in O(n)O(n) per iteration.
  • Design the mini-batch mechanism to compute unbiased stochastic gradient vectors gt\mathbf{g}_t at each time step.
  • Normalize updates by Dt1D_t^{-1} as outlined.
  • Step-size η\eta can be treated as a global hyperparameter, typically tuned for robust performance; in some contexts, further tuning or adaptive strategies per coordinate may be beneficial.
  • The algorithm is suitable for streaming, distributed, or parallelized environments, supporting large-scale deployment with minimal per-iteration resource requirements.

6. Practical Applications and Extensions

Adaptive coordinate-wise online mini-batch SGD is well-suited for:

  • Large-scale online convex optimization
  • Sparse high-dimensional machine learning problems
  • Scenarios with non-uniform feature activation or heavy-tailed gradient distributions
  • Internet-scale and streaming data applications where feature relevance and scales evolve over time

It enables both improved statistical efficiency—via regret minimization tailored to observed data geometry—and scalable, high-throughput learning in computationally constrained settings.

The approach is related to on-line conditioning, Adagrad-style adaptivity, and diagonal preconditioning frameworks. Its per-coordinate adaptation can be viewed as a forerunner to subsequent adaptive learning rate schemes that proliferate in deep learning and large-scale optimization, and its theoretical rigor underpins a broad class of modern stochastic optimization algorithms.


Summary Table: Coordinate-wise Online Mini-Batch SGD: Properties

Aspect Technique/Formula Benefit/Implication
Per-coordinate update xt+1,i=xt,iηgt,idt,ix_{t+1,i} = x_{t,i} - \eta \frac{g_{t,i}}{d_{t,i}} Adaptive step-size per coordinate
Diagonal precondition Dt=D_t = diag of dt,id_{t,i} Data-driven normalization for gradient updates
Regret bound O(iXitgt,i2)O(\sum_i X_i \sqrt{\sum_t g_{t,i}^2}) Tighter theoretical guarantees than standard OGD
Empirical efficiency O(n)O(n) per iteration Suitable for large-scale, high-dimensional data

This algorithmic class, as developed and analyzed in (Streeter et al., 2010), provides both the theoretical and practical foundation for efficient and reliable online learning strategies in high-dimensional and streaming contexts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Coordinate-Wise Online Mini-Batch SGD.