Online Stochastic Batch Learning (OSBL)

Updated 16 May 2026

OSBL is a framework where learners process data in mini-batches within an online streaming environment, balancing adaptivity and computational cost.
It unifies settings such as stochastic optimization, bandits, neural training, and mixture modeling through delayed, batched feedback updates.
Selecting the optimal batch size in OSBL is crucial to trade off between sample adaptivity, computational efficiency, and statistical accuracy.

Online Stochastic Batch Learning (OSBL) refers to a broad collection of schemes where learners interact with data in mini-batches, but operate within an online or streaming regime. OSBL unifies a range of settings, including stochastic convex optimization, non-convex deep learning, stochastic bandits, mixture modeling, and online resource allocation. In all cases, the learner updates model or policy parameters only intermittently (at batch boundaries or using batched feedback), trading off sample adaptivity, computational efficiency, and statistical accuracy.

1. Formal Definitions and Modeling Foundations

The OSBL paradigm typically assumes an environment where data arrives sequentially, but the learner receives feedback or updates parameters only after accumulating a batch of $b$ instances. The canonical OSBL structures are:

Online Stochastic Bandit Setting: The agent selects actions over $T$ rounds, partitioned into $B$ batches of size $b$ , receiving cumulative feedback only at batch boundaries. The agent’s policy $\pi^b$ must select actions in each batch using only information from the previous batches, inducing delayed, non-adaptive feedback (Provodin et al., 2021, Provodin et al., 2022).
Stochastic Optimization / SGD: In streaming optimization, at each time $t$ the learner observes a mini-batch of $n_t$ samples and performs a stochastic gradient step using only that batch. Batch size $n_t$ may be constant or time-varying (Godichon-Baggioni et al., 2022).
Non-convex Neural Training: Batches are dynamically formed by online selection—data points with higher loss may be preferentially sampled to construct each mini-batch, introducing non-uniform but online-adaptive training dynamics (Loshchilov et al., 2015).
Batch Online Learning for Click Prediction: Data streams are subdivided into temporal batches (e.g. daily), and the learner processes each batch via early stopping or proximal regularized updates to balance historical knowledge and recent data (Iyer et al., 2018).
Online EM and Mixture Models: The online EM update for mixture models is extended by replacing singleton data arrivals with randomly subsampled mini-batches, yielding Robbins–Monro stochastic approximations with batch feedback (Nguyen et al., 2019).

All models must specify: batch formation (static/dynamic, fixed/variable sizes), feedback structure (delayed, aggregated), and policy update constraints (history dependence restricted to batch boundaries).

2. Regret, Convergence, and Statistical Guarantees

OSBL induces modified statistical and computational properties compared to the purely online (per-sample) or offline (full batch) settings.

Regret in Batched Bandits: Fix $K$ arms, horizon $T$ , and batch size $T$ 0. For any base policy $T$ 1 (UCB, Thompson Sampling, etc.), batching induces regret at most $T$ 2 times the $T$ 3-step regret of $T$ 4:

$T$ 5

For policies with $T$ 6, this implies $T$ 7 (Provodin et al., 2021, Provodin et al., 2022).

Stochastic Optimization: In streaming OSBL-SGD, convergence bounds for the mean squared error $T$ 8 depend on batch scaling, data dependence, and Polyak-Ruppert averaging. With batch size $T$ 9, convergence is:

$B$ 0

For i.i.d. (unbiased) gradients and appropriate averaging, the statistical rate approaches $B$ 1, the offline optimal (Godichon-Baggioni et al., 2022).

Non-convex Neural Network Training: Online batch selection (by loss-rank) yields no formal convergence proof under non-uniform sampling, but empirical evidence shows similar or improved generalization and consistent $B$ 2 faster loss reduction compared to uniform random batching (Loshchilov et al., 2015).
Online EM with Mini-batch: Truncated Robbins–Monro with batch EM converges almost surely to stationary points of the empirical log-likelihood, under standard regularity and truncation to bounded domains (Nguyen et al., 2019).

3. Algorithmic Design Patterns

Table 1 summarizes representative OSBL algorithm templates from the literature.

Setting	OSBL Instantiation (Algorithmic Pattern)	Regret/Convergence Bound
Stochastic Bandit	Batchify any online policy; update at batch end	$B$ 3
Convex/Stochastic Optimization	Batched/variable-size SGD with averaging	$B$ 4 (with averaging)
Neural Training	Dynamic batch selection via loss-rank sampling	Empirical $B$ 5 speedup
Online EM	Mini-batch Robbins–Monro, truncation stabilization	a.s. convergence to empirical max.
Online-to-Batch Conversion	Black-box conversion of online to anytime batch output	$B$ 6 up to $B$ 7

Batched feedback implies action distributions (bandits) or parameter states (SGD, EM) remain fixed within a batch, with updates only upon feedback arrival. Dynamic batch selection (neural) injects additional non-uniformity, requiring explicit sampling schedules.

4. Batch Size Selection and Trade-Offs

OSBL induces a fundamental trade-off between computational overhead (update frequency) and statistical adaptivity (batch size):

Small batches ( $B$ 8): Maximum adaptivity, per-sample updates; regret/convergence aligns with fully-online regime, but maximal computational cost and feedback volume.
Large batches ( $B$ 9): Fewer policy/model updates amortize per-batch engineering, communication, or computation cost. Regret increases only as $b$ 0, so for moderate $b$ 1 (e.g., $b$ 2– $b$ 3), performance remains close to online optimal, as predicted and empirically validated (Provodin et al., 2021, Provodin et al., 2022).

Closed-form optimization of batch size $b$ 4 arises in cost-regularized objectives such as

$b$ 5

with $b$ 6 (Provodin et al., 2021).

Dynamic or growing batches (SGD): Time-varying batch sizes ( $b$ 7) can be employed for non-i.i.d. or dependent data to suppress gradient bias and break long-range correlations (Godichon-Baggioni et al., 2022).

Table 2: Empirical impact of batch size (bandit and supervised settings).

Task	Batch Size Growth	Empirical Finding
Bandit regret	$b$ 8 from $b$ 9 to $\pi^b$ 0	Regret $\pi^b$ 1; TS more robust
Neural train.	Dynamic $\pi^b$ 2 schedule	up to $\pi^b$ 3 speedup; same validation

5. Policy and Model Selection in OSBL

Certain algorithmic principles are broadly supported by empirical and theoretical analyses:

Randomized policies (bandits): Algorithms such as Thompson Sampling or its contextual variants (LinTS) exhibit greater robustness to batch-induced adaptivity loss compared to deterministic rules (UCB, LinUCB), especially for moderate-to-large batch sizes (Provodin et al., 2021, Provodin et al., 2022).
Importance of Averaging: Polyak–Ruppert averaging applied to batched/streaming OSBL-SGD yields an optimal $\pi^b$ 4 error rate, uniformly across batch sizing strategies (Godichon-Baggioni et al., 2022, Cutkosky, 2019).
Greedy sampling pressure: Aggressive likelihood-driven data selection in neural OSBL can speed convergence but risks staleness; annealed selection pressure prevents divergence and preserves generalization (Loshchilov et al., 2015).
Proximal/early-stopping updates: Balancing adaptation to new data and retention of past knowledge in batch online learning is achieved via either a proximal penalty or early stopping criteria, with matching theoretical guarantees when parameters are tuned so that $\pi^b$ 5 (Iyer et al., 2018).

6. Application-Specific OSBL Extensions

OSBL has been adapted to particular domains and extended in several dimensions:

Constrained Online Optimization: Offline-aided-online SAGA combines empirical risk minimization with queue-driven adaptation, yielding cost-delay tradeoffs superior to classical stochastic dual gradient/backpressure approaches (Chen et al., 2016).
Distributed and Parallel Implementation: Multi-block and distributed Douglas–Rachford splitting variants enable scalable OSBL for composite convex and regularized objectives, with $\pi^b$ 6 convergence in the stochastic and batch settings (Shi et al., 2013).
Mixture Modeling and EM: Truncated mini-batch online EM for exponential-family mixtures, with carefully scheduled step sizes and domain truncation, ensures stable and consistent maximum-likelihood estimation, outperforming standard EM in high-dimensional, big-data settings (Nguyen et al., 2019).

7. Practical Recommendations and Limitations

Across methods and domains, the following recommendations and limitations arise:

Batch size: Use the smallest $\pi^b$ 7 compatible with system constraints; in typical tasks, $\pi^b$ 8 in the range $\pi^b$ 9– $t$ 0 achieves $t$ 1 of fully-online performance at drastically reduced update cost. For highly non-stationary environments or dependent data, consider dynamic or growing batch sizes (Provodin et al., 2021, Godichon-Baggioni et al., 2022).
Averaging: Always employ parameter averaging in stochastic convex tasks to mitigate noise and bias, especially with large or variable batches (Godichon-Baggioni et al., 2022).
Algorithm selection: In bandits, randomized policies (TS, LinTS) are empirically superior for batched feedback. In non-convex or highly non-uniform data regimes, anneal sampling pressure and batch size to maintain stability (Loshchilov et al., 2015).
Tuning and overhead: OSBL schemes often introduce hyperparameters (batch size, selection pressure, sort/recompute frequency) requiring empirical tuning for optimum performance in each domain (Loshchilov et al., 2015).
Limitation: For highly non-uniform batch schedules or in the presence of strong dependence/bias, closed-form convergence rates may be unavailable or not match empirical outcomes. Resort to empirical validation and robust parameter scheduling.
Domain-specific stabilization: In mixture modeling or generalized EM, explicit truncation or reset strategies prevent divergence from empirical log-likelihood optima (Nguyen et al., 2019).

In summary, OSBL provides a modular, robust, and well-characterized framework for learning under batch-structured data and feedback constraints. Its theory and practice are underpinned by precise regret and convergence decompositions, generalizable design patterns, and empirically validated guidelines for batch size, policy/model selection, and adaptation to non-stationarity (Provodin et al., 2021, Loshchilov et al., 2015, Godichon-Baggioni et al., 2022, Provodin et al., 2022, Shi et al., 2013, Chen et al., 2016, Nguyen et al., 2019, Iyer et al., 2018, Cutkosky, 2019).