Dynamic Batch Sizing & Automatic Sharding

Updated 20 December 2025

Dynamic Batch Sizing and Automatic Sharding are methods that adjust mini-batch sizes and partition datasets to minimize idle time in heterogeneous DNN training.
They measure per-epoch throughput to proportionally allocate workloads and assign data shards, ensuring synchronized completion across all workers.
Empirical results show up to a 35% reduction in training time and enhanced resilience against straggler effects in mixed GPU environments.

Dynamic Batch Sizing (DBS) and Automatic Sharding are techniques that address inefficiencies in distributed deep neural network (DNN) training, particularly in heterogeneous environments where computational and network performance varies among worker nodes. The core aim is to mitigate idle time and straggler effects that arise with conventional Synchronous Stochastic Gradient Descent (S-SGD), where fixed per-worker batch sizes and static dataset divisions cause high-performance workers to wait for the slowest at synchronization barriers. DBS dynamically re-balances each worker's mini-batch size and allocates a proportionate shard of the global dataset at every training epoch, ensuring all workers complete their tasks within comparable wall-clock durations. This procedure maximizes hardware utilization, accommodates variability in worker throughput, and maintains convergence guarantees for the training process (Ye et al., 2020).

1. Framework Structure and Objectives

DBS consists of two tightly coupled components: (1) Dynamic Batch Sizing and (2) Automatic Data Sharding.

Dynamic Batch Sizing: At the conclusion of each epoch, each worker reports its processing throughput, measured as the number of samples processed per unit of wall-clock time. This throughput metric guides the computation of mini-batch sizes for the next epoch, proportionally assigning larger workloads to more capable workers.
Automatic Data Sharding: Once the updated mini-batch sizes are determined, the global dataset is partitioned into disjoint contiguous shards, each sized to the corresponding batch. This proportional allocation ensures each worker processes fresh data that matches its assigned compute capacity. Both steps are performed at the end of every epoch, enabling the system to adapt automatically to evolving cluster performance.

The central goal is to synchronize worker completion times within each epoch, thereby eliminating cluster underutilization due to straggler effects.

2. Algorithmic Methodology

The DBS distributed training loop and the dataset partition adjustment are specified as follows:

Main Training Loop (Synchronous SGD)

initialize each worker i with data[0..1]  # full dataset
for epoch t = 0 to T−1:
    for each worker i in parallel:
        measure wall_clock_time_i  # execution time for b_i^(t)
        throughput_i = b_i^(t) / wall_clock_time_i
        compute local gradient g_i^(t)
    # All-reduce aggregated gradient
    g^(t) = (1/n) * sum_i g_i^(t)
    # Parameter update
    x^(t+1) = x^(t) – γ * g^(t)
    # Dynamically adjust batch sizes and data shards
    [L^(t+1), K^(t+1)] = DynamicDatasetAdjust(throughput^(t), B)

Batch Size and Shard Calculation (DynamicDatasetAdjust)

function DynamicDatasetAdjust(throughput[1..n], B_total)
    S = sum_j throughput[j]
    for i in 1..n:
        B_real[i] = B_total * (throughput[i] / S)
    for i in 1..n:
        B̆[i] = floor(B_real[i])
    deficit = B_total - sum_i B̆[i]
    # Distribute leftover samples to largest fractions
    create list frac_pairs = [(i, B_real[i] - B̆[i]) for i=1..n]
    sort frac_pairs descending by second
    for k = 1..deficit:
        idx = frac_pairs[k].i
        B̆[idx] += 1
    total = sum_i B̆[i]
    prefix = 0.0
    for i in 1..n:
        L[i] = prefix / total
        K[i] = (prefix + B̆[i]) / total
        prefix += B̆[i]
    return (L[1..n], K[1..n])

In this workflow, the exact integer allocation of batch sizes is achieved via a "floor-then-fractional-assignment" approach to ensure that the sum equals $B_\mathrm{total}$ . Data shards are represented as non-overlapping contiguous ranges of normalized dataset indices $[L_i^{(t+1)}, K_i^{(t+1)})$ .

3. Key Mathematical Formulation

Let worker $i$ at epoch $t$ process $b_i^{(t)}$ samples in wall-clock time $T_i^{(t)}$ . The throughput is $w_i^{(t)} = b_i^{(t)} / T_i^{(t)}$ . The next epoch's batch size assignment is

$b_i^{(t+1)} = B_\mathrm{total} \cdot \frac{w_i^{(t)}}{\sum_{j=1}^n w_j^{(t)}}$

Integer batch sizes $B̆_i^{(t+1)}$ are assigned by rounding down $b_i^{(t+1)}$ and distributing any leftover samples based on the largest fractional parts. Each worker $i$ 's assigned dataset shard is indexed by

$L_i^{(t+1)} = \frac{\sum_{j=1}^{i-1} B̆_j^{(t+1)}}{B_\mathrm{total}},\quad K_i^{(t+1)} = \frac{\sum_{j=1}^{i} B̆_j^{(t+1)}}{B_\mathrm{total}}$

Subsequently, the worker processes mini-batch samples corresponding to indices in $[L_i^{(t+1)}, K_i^{(t+1)})$ .

4. Theoretical Convergence Properties

The paper establishes that the variable batch sizes and shard allocations of DBS do not impede the convergence of S-SGD, provided the underlying loss function $f$ is $\mu$ -strongly convex and $L$ -smooth, and the stochastic gradients have bounded variance $\sigma^2$ . Under conditions where $\gamma \leq 1/L$ , aggregate expected loss convergence follows:

$\mathbb{E}[\|x^j - x^*\|^2] \leq (1 - \gamma\mu)^j \|x^0 - x^*\|^2 + \frac{\gamma \sigma^2}{\mu}$

This result is achieved by bounding the expected squared distance to the optimum using the variance reduction from increased batch sizes, ensuring that the noise term scales as $O(1/b)$ , which is inherited from the standard S-SGD variance analysis. Consequently, DBS attains the same convergence rate as classical synchronized mini-batch SGD.

5. System Implementation and Optimizations

Throughput measurement relies on wall-clock timing, which can be implemented using host or device timers. Batch sizes and dataset shards are recomputed at the end of each epoch. Synchronous communication is handled via all-reduce, employing either NCCL (for CUDA environments) or MPI collectives, with overlapping of gradient computation and network communication where feasible.

Key system-level optimizations include:

Packing variable-size batches into fixed-size communication buffers to avoid network efficiency penalties.
Caching data shard boundaries to minimize data movement and support efficient, range-based data loading.
Adapting to stragglers or failed workers, as the next epoch's batch size naturally shrinks for underperforming nodes, yielding robust training even in the presence of node failures or transient slowdowns.

These measures collectively ensure low-overhead redistribution of dataset partitions and maintain lockstep operation across heterogeneous clusters.

6. Empirical Results and Practical Impact

On an 8-node GPU cluster with mixed hardware (V100, P100, and K80), the DBS protocol was evaluated on ResNet-50 training over ImageNet for 90 epochs. The following table summarizes the principal comparative results:

Method	Total Time (min)	Relative Speedup	Max Straggler Delay
S-SGD (static B=256)	200	1.00×	N/A
Model Averaging	185	1.08×	up to 100 ms
One-Shot Rebalance	175	1.14×	up to 200 ms
DBS (Dynamic Batch Size)	150	1.33×	up to 500 ms

DBS outperforms both traditional S-SGD and model averaging in total training time and is especially robust under engineered adverse conditions (e.g., random network delays of 0–500 ms or CPU throttling). In such scenarios, S-SGD experiences significant stalls, and other strategies degrade markedly, whereas DBS maintains within 10% of baseline speed and delivers a sustained 25% speedup in the presence of persistent worker slowdown. Overall, DBS achieves 30–35% reduction in end-to-end training time and exhibits strong straggler resilience.

7. Context, Limitations, and Applicability

DBS generalizes to any synchronous data-parallel training scenario with per-epoch measurement and dynamic adjustment intervals. Its core strength lies in non-intrusive, epoch-wise adaptation: no changes to underlying SGD, no assumptions about specific model architectures, and immediate fault tolerance due to continuous throughput rebalancing. While effective in heterogeneously resourced environments, the proportional shard assignment assumes that partitioning of the dataset can be achieved efficiently and that data shuffling or stratification needs are either not present or are managed suitably at higher-level frameworks. A plausible implication is that for tasks heavily reliant on strict data order or complex sampling, additional modifications may be required. Nevertheless, for general DNN training workloads, DBS provides significant efficiency benefits with theoretically principled behavior (Ye et al., 2020).

Markdown Upgrade to Chat

References (1)

DBS: Dynamic Batch Size For Distributed Deep Neural Network Training (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Batch Sizing and Automatic Sharding.