Mini-Batch Stochastic Gradient Descent

Updated 15 December 2025

Mini-Batch SGD is an optimization algorithm that uses randomly sampled subsets to approximate full gradients, reducing variance and enabling parallel updates.
The method controls the bias-variance tradeoff by adjusting batch size, where larger batches lower gradient variance and smaller ones allow more frequent updates.
Advanced sampling techniques such as stratified, antithetic, and DPP sampling further enhance convergence speed and improve generalization in various applications.

Mini-batch Stochastic Gradient Descent (SGD) is a foundational optimization algorithm in large-scale machine learning, extensively adopted for both convex and nonconvex problems due to its capacity to balance computational efficiency and statistical accuracy. The core idea is to approximate the full gradient at each step using a randomly sampled mini-batch of training examples, enabling both variance reduction and parallelization.

1. Formalization and Core Algorithm

At each iteration $t$ , mini-batch SGD forms the update

$\theta_{t+1} = \theta_t - \eta_t \cdot g_t, \quad g_t = \frac{1}{b_t}\sum_{i\in B_t} \nabla f_i(\theta_t)$

where $B_t$ is a batch of $b_t$ indices sampled (usually uniformly) from $\{1,\dots,n\}$ , $\eta_t$ is the step size, and $f_i$ is the loss per example (Qian et al., 2013, Qian et al., 2020).

The mini-batch gradient $g_t$ is an unbiased estimator of the population gradient when $B_t$ is sampled uniformly, with variance scaling inversely in $b_t$ : $\mathrm{Var}(g_t) = \frac{1}{b_t}\mathrm{Var}_i[\nabla f_i(\theta_t)]$ In practical variants, $b_t$ can be fixed, scheduled, or adapted based on optimization signals (Umeda et al., 7 Aug 2025, Sievert et al., 2019).

2. Variance, Convergence, and the Role of Batch Size

The variance of the stochastic gradient estimator is strictly decreasing in the batch size $b$ , as rigorously established for linear and deep linear networks (Qian et al., 2020). For any fixed $t$ , the variance is a polynomial in $1/b$ (with no constant term), implying monotonic decay: $\mathrm{Var}(g_t^b) = \beta_1\frac{1}{b} + \beta_2\frac{1}{b^2} + \cdots$ This analytic dependence underpins two key operational effects:

Variance reduction via larger $b$ : Increasing the batch size reduces stochasticity in the update, allowing for more aggressive learning rates and lower steady-state error in convex problems (Jain et al., 2016, Zhang et al., 2015).
Bias–variance tradeoff: While larger $b$ reduces variance, it also decreases the number of parameter updates per epoch, potentially slowing practical convergence for excessively large batches (Qian et al., 2013, Qian et al., 2020).

In convex and smooth settings, the convergence rate of mini-batch SGD with a fixed batch size $b$ and constant step-size $\eta$ is $O(1/T) + O(1/b)$ (Qian et al., 2013), while for nonconvex objectives a rate of $O(1/\sqrt{T})$ is typical. Schedulers that increase batch size during training can asymptotically reduce the variance to zero, matching full gradient descent in the limit as $b\to n$ (Umeda et al., 13 Sep 2024, Sievert et al., 2019).

3. Mini-Batch Sampling Strategies and Advanced Variance Control

Beyond uniform sampling, several sampling and batch construction strategies have been developed to further accelerate convergence by reducing the variance of the mini-batch gradient estimator.

Stratified Sampling: The dataset is partitioned into $k$ clusters or strata with low within-cluster variance, and the mini-batch is drawn proportionally to cluster size and intra-cluster dispersion. This gives a strictly lower gradient-variance upper bound compared to uniform sampling and yields empirically faster convergence (30–100% fewer epochs) (Zhao et al., 2014).
Antithetic Sampling: Pairs of examples with negatively correlated gradients are assigned to batches, introducing negative covariance in the gradient sum. This leads to a dramatic reduction in variance, particularly notable in binary classification with precomputed antithetic tables (Liu et al., 2018).
Determinantal Point Processes (DPP): DPP sampling generates diverse (in feature space) mini-batches by maximizing mutual dissimilarity. DM-SGD provably reduces the off-diagonal covariance contributions in the gradient estimator and empirically yields improved generalization and faster convergence, especially in imbalanced settings (Zhang et al., 2017).
Typicality Sampling: Samples are ranked and batches are biased towards regions of high data-density (typicality), as determined by, e.g., a t-SNE embedding followed by density estimation. This strict variance reduction is realized even at the potential cost of introducing minor bias, leading to improved linear convergence in convex settings (Peng et al., 2019).
Variance-Reduced Mini-Batch Methods: Semi-stochastic and variance-reduced mini-batch algorithms (e.g., PS2GD, SVRG) alternate full-gradient computation epochs with inner mini-batch steps using control variates, yielding linear convergence under weak strong convexity, with the per-iteration gradient variance scaling as $O(1/b)$ (Liu et al., 2016).

4. Dynamic Scheduling and Adaptation of Batch Size/Learning Rate

Optimal performance in mini-batch SGD is sensitive to the scheduling of batch size and learning rate.

Critical Batch Size: There exists a theoretically optimal batch size ( $b_\epsilon^*\sim O(1/\epsilon^2)$ ) minimizing the stochastic first-order oracle (SFO) complexity for reaching a target gradient-norm ( $\epsilon$ ). Adaptively tuning batch size and learning rate, either linearly or exponentially (jointly increasing both), accelerates convergence and can achieve geometrically rapid decay in the gradient norm (Umeda et al., 7 Aug 2025, Umeda et al., 13 Sep 2024).
Adaptive Schedulers: Algorithms that grow the batch size proportionally to decreasing training loss (or gradient norm), e.g., $b_k = c/(F(w_k)-F^*)$ , match the iteration-complexity of gradient descent and total cost of standard SGD (Sievert et al., 2019). Practical implementations can leverage rolling-average proxies for the loss to reduce overhead.
Training Time Optimization: Empirically, the total number of SGD updates required to reach an error threshold scales as $N(b) = N_\infty + \alpha/b$ . Minimizing wall-clock time involves selecting $b$ to saturate hardware throughput, typically at the "knee" of the $t(b)$ curve (Perrone et al., 2019).

Scheduler	LR Policy	Batch Policy	Convergence Rate
Constant b, decaying LR	$\eta_t\downarrow$	$b$ constant	$O(1/\sqrt{T})$
Increasing b, decaying LR	$\eta_t\downarrow$	$b_t\uparrow$	$O(1/\sqrt{T})$ , lower variance
Jointly increasing b & LR	$\eta_t\uparrow$	$b_t\uparrow$	Geometric (per schedule)

Schedulers with joint increases or warm-up followed by decay accelerate decrease in gradient norm and final risk (Umeda et al., 13 Sep 2024, Umeda et al., 7 Aug 2025).

5. Parallelization, Model Averaging, and Hardware Considerations

Mini-batch SGD is inherently parallelizable. Averaging gradients over larger batches enables synchronous parallel updates, and tail-averaging (suffix-averaging) of iterates can further reduce variance in the output solution.

Parallel Speedup: For least squares regression, mini-batching combined with appropriate step size yields near-linear reduction in number of serial updates up to a problem-dependent threshold batch size $b^*=1+R^2/\|H\|_2$ . Beyond this, further increases in $b$ yield only sublinear speedup and diminish returns in variance reduction, motivating careful tradeoff (Jain et al., 2016).
Model Averaging: Parameter mixing/averaging across $P$ independent learners performs with ideal $1/P$ scaling in the minimax excess risk after suitable burn-in, exploiting both statistical efficiency and distributed computation (Jain et al., 2016).
System-Level Optimization: Practical selection of $b$ must account for hardware throughput; the optimal batch coincides with the batch size that saturates compute resources, balancing algorithmic and system-level bottlenecks (Perrone et al., 2019).

6. Extensions and Domain-Specific Applications

The mini-batch SGD paradigm extends to specialized settings and complex models:

Distance Metric Learning: Mini-batch SGD amortizes expensive PSD projections by updating on a batch of triplet constraints, with smooth surrogate losses providing $O(1/N)$ convergence and 5–10× wall-clock speedups (Qian et al., 2013).
Gaussian Process Hyperparameter Learning: Mini-batch SGD provably fits hyperparameters for Gaussian processes, with convergence rate $O(1/K) + O(m^{-1/2+\epsilon})$ , under mild kernel eigendecay assumptions, enabling scalable GP inference (Chen et al., 2021).
Quantum Measurement Tomography: Mini-batch SGD, with advanced parameterizations (e.g., Stiefel manifold, Hermitian normalization), achieves superior scalability and convergence for reconstructing quantum measurement operators compared to constrained convex optimization (Gaikwad et al., 19 Nov 2025).
Constrained and Composite Objectives: For problems where only part of the objective is amenable to stochastic approximation, mini-batch SGD achieves linear convergence to a shrinking neighborhood characterized by batch size and properties of the deterministic term (Li et al., 3 Sep 2025).

7. Statistical Properties and Generalization Dynamics

The stochastic dynamics of mini-batch SGD, including its diffusion in parameter space, play a central role in implicit regularization and generalization performance.

Noise as Effective Temperature: The algorithmic noise introduced by mini-batch sampling can be quantified as an effective temperature $T_{\rm eff}$ , scaling like $\eta (1-b)/b$ (Mignacco et al., 2021). Higher "noise temperatures" (smaller $b$ or higher $\eta$ ) facilitate exploration of broad, flat minima, correlating with improved robustness.
Fokker-Planck Analysis and Minima Sharpness: Continuum approximations via SDE and Fokker-Planck equations show that small-to-moderate mini-batch sizes facilitate escape from sharp minima, while very large batches (small noise) cause long trapping in sharp, poor generalization minima. In the stationary regime, the solution distribution $p_\infty(\theta)\propto\exp(-2B L(\theta)/(\eta \sigma^2))$ concentrates on the flat minima, but the convergence rate to stationarity scales as $1/B$ (Dai et al., 2021).
Quadratic Locality of Loss and SGD Averaging: Fixed mini-batch losses along SGD trajectories are locally convex quadratic, enabling surprisingly accurate one-step minimization per batch with sufficiently large step size. Stationarity and averaging (e.g., SWA, EMA) are theoretically linked by implicit learning rate reductions, further biasing SGD toward flat regions in the objective (Sandler et al., 2023).

References

(Qian et al., 2013) Efficient Distance Metric Learning by Adaptive Sampling and Mini-Batch Stochastic Gradient Descent (SGD)
(Zhao et al., 2014) Accelerating Minibatch Stochastic Gradient Descent using Stratified Sampling
(Jain et al., 2016) Parallelizing Stochastic Gradient Descent for Least Squares Regression: mini-batching, averaging, and model misspecification
(Liu et al., 2016) Projected Semi-Stochastic Gradient Descent Method with Mini-Batch Scheme under Weak Strong Convexity Assumption
(Zhang et al., 2017) Determinantal Point Processes for Mini-Batch Diversification
(Liu et al., 2018) Accelerating Stochastic Gradient Descent Using Antithetic Sampling
(Peng et al., 2019) Accelerating Minibatch Stochastic Gradient Descent using Typicality Sampling
(Sievert et al., 2019) Improving the convergence of SGD through adaptive batch sizes
(Perrone et al., 2019) Optimal Mini-Batch Size Selection for Fast Gradient Descent
(Qian et al., 2020) The Impact of the Mini-batch Size on the Variance of Gradients in Stochastic Gradient Descent
(Chen et al., 2021) Gaussian Process Inference Using Mini-batch Stochastic Gradient Descent: Convergence Guarantees and Empirical Benefits
(Dai et al., 2021) On Large Batch Training and Sharp Minima: A Fokker-Planck Perspective
(Mignacco et al., 2021) The effective noise of Stochastic Gradient Descent
(Sandler et al., 2023) Training trajectories, mini-batch losses and the curious role of the learning rate
(Umeda et al., 13 Sep 2024) Increasing Both Batch Size and Learning Rate Accelerates Stochastic Gradient Descent
(Umeda et al., 7 Aug 2025) Adaptive Batch Size and Learning Rate Scheduler for Stochastic Gradient Descent Based on Minimization of Stochastic First-order Oracle Complexity
(Li et al., 3 Sep 2025) Stochastic versus Deterministic in Stochastic Gradient Descent
(Gaikwad et al., 19 Nov 2025) Quantum measurement tomography with mini-batch stochastic gradient descent