Token-Balanced Batching

Updated 10 April 2026

Token-balanced batching is a strategy that allocates computational work based on token counts rather than sample counts, ensuring balanced workloads across devices.
It employs algorithmic approaches like elastic and memory-centric batching to optimize GPU utilization and minimize latency in LLM inference and vision model training.
Empirical results show significant throughput improvements and reduced waiting times, though it introduces scheduling overhead and latency trade-offs in certain scenarios.

Token-balanced batching is a family of batching regimes that aim to allocate computational work according to the number of tokens—rather than merely the number of requests or samples—in each batch. This approach is motivated by the need to maximize hardware utilization, stabilize learning dynamics, and control latency in both training and inference of large-scale models, particularly when serving requests or data with highly variable token counts. Token-balanced batching has found applications in distributed LLM inference (e.g., pipeline and tensor parallel settings), large-scale vision model training with variable input resolutions, and in queueing-theoretic analyses of inference workloads.

1. Core Principles and Motivations

Conventional batching strategies typically impose limits based on the number of requests per batch or the total token count. In settings where requests or samples have highly variable lengths—such as autoregressive LLM inference or training vision transformers on mixed-resolution images—this can result in suboptimal hardware utilization, large gradient imbalances, or unacceptably long tail latencies. Token-balanced batching explicitly targets the total token count per batch as the primary workload determinant. The central goals are:

To maintain balanced computational loads across devices or pipeline stages.
To smooth per-iteration or per-batch token counts, reducing "valleys" of under-utilization and eliminating peaks that risk memory exhaustion.
To decouple batch sizing from the number of items, focusing instead on cumulative token workload.
To guarantee fairness and optimization stability when items contribute vastly different token counts, e.g., mixing high- and low-resolution images or handling LLM requests with widely varying output lengths (Chaybouti et al., 23 Dec 2025, Yang et al., 2024, Guo et al., 21 Apr 2025).

2. Algorithmic Approaches and Formulations

Algorithmic realizations of token-balanced batching vary by domain but share a computational bin-packing ethos: batches are constructed to greedily (or optimally) approach, but not exceed, a specified token budget, sometimes under constraints on the maximum number of items or specific resource limits.

2.1. LLM Inference: Memory-Centric and Elastic Batching

Memory-centric batching replaces the conventional cap on requests with a cap on the incremental key-value (KV) memory that additional prefill tokens would require, ensuring efficient use of GPU memory and maximizing the size of each token batch. The batch size in tokens $B_{\text{tok}}$ is thus

$\text{maximize } B_{\text{tok}} \quad \text{subject to } \sum_{i} \text{KVsize}(i) \leq M_\text{threshold},\quad \text{chunk size} \leq S_\text{chunk}$

where $M_\text{threshold}$ is set based on available KV-cache at each iteration (Zheng et al., 2024).

Elastic batching (from a queueing-theoretic perspective) adapts batch formation so requests with shorter output lengths can exit earlier, rather than being forced to wait for the longest in the batch. This reduces mean waiting time and latency for the majority of requests (Yang et al., 2024). The elastic-batch service time is modeled as:

$H^{elastic} = k_1 b + k_2 + k_3 b E[N] + k_4 n_b$

where $b$ is batch size, $E[N]$ average output length, and $n_b$ maximum output length in the batch.

2.2. Distributed Training: Input Packing for Variable-Length Data

In multi-resolution vision training, token-balanced batching relies on packing as many images as possible into a batch, provided the aggregate token count does not exceed a fixed maximum sequence length $C_{\text{max}}$ (Chaybouti et al., 23 Dec 2025). The bin-packing process per device is:

Order/shuffle candidate images.
Fill each sequence by greedily aggregating images until adding the next would exceed $C_{\text{max}}$ .
Apply block-diagonal attention masks (e.g., FlexAttention) to maintain independence across images.
Normalize losses by per-image token count, stabilizing gradients across resolutions.

3. Scheduling, Grouping, and Runtime Coordination

In complex parallel or distributed systems, token-balanced batching is integrated with advanced scheduling and grouping policies.

Global Prefix Sharing: BatchLLM constructs a radix prefix-tree over all prompts to identify prefix-sharing groups, maximizing KV-cache reuse and minimizing redundant computation (Zheng et al., 2024).
Reordering by Decode Ratio: Requests/groups with higher decoding-to-prefill ratios are prioritized so that decoding tokens can be interleaved with prefill workloads of later (longer) groups, which sustains GPU occupancy throughout batch iterations.
Independent Prefill and Decode Control Loops: gLLM decouples prefill and decode token budgets using distinct "rate controllers" driven by live system state, such as pending token queues and observed KV-cache headroom. The budgets are:
- Prefill $\#P$ : Adjusted by tokens-waiting and KV-cache utilization controllers, throttled or suspended under memory pressure.
- Decode $\text{maximize } B_{\text{tok}} \quad \text{subject to } \sum_{i} \text{KVsize}(i) \leq M_\text{threshold},\quad \text{chunk size} \leq S_\text{chunk}$ 0: Kept near-uniform across pipeline stages; assigned via
$\text{maximize } B_{\text{tok}} \quad \text{subject to } \sum_{i} \text{KVsize}(i) \leq M_\text{threshold},\quad \text{chunk size} \leq S_\text{chunk}$ 1

where $\text{maximize } B_{\text{tok}} \quad \text{subject to } \sum_{i} \text{KVsize}(i) \leq M_\text{threshold},\quad \text{chunk size} \leq S_\text{chunk}$ 2 is decode tokens and $\text{maximize } B_{\text{tok}} \quad \text{subject to } \sum_{i} \text{KVsize}(i) \leq M_\text{threshold},\quad \text{chunk size} \leq S_\text{chunk}$ 3 pipeline depth (Guo et al., 21 Apr 2025).

Non-blocking Scheduling and Preemptive Metadata Delivery: High-performance runtimes (e.g., gLLM) use asynchronous message passing and overlap I/O with computation to eliminate pipeline bubbles due to imbalanced batch composition.

4. Empirical Results and Quantitative Impact

Empirical studies across LLM inference and vision model training have consistently demonstrated the advantages of token-balanced batching:

LLM Inference:
- gLLM achieves throughput increases from 11% up to 398% over strong baselines such as vLLM and SGLang, with marked reductions in end-to-end latency and improved deadline attainment for real-time workloads (Guo et al., 21 Apr 2025).
- BatchLLM raises steady-state GPU occupancy from ~50–60% to >90%, dramatically increasing per-iteration token utilization and throughput by 1.3×–2.0× in practical settings (Zheng et al., 2024).
- Queuing-theoretic elastic batching yields mean waiting time reductions of 30–50% compared to static batch regimes; optimally chosen batch-size and token limits cut waiting times by an order of magnitude under realistic arrival rates (Yang et al., 2024).
Vision Model Training:
- Packing images up to a token budget (with loss normalization) prevents catastrophic forgetting of low-resolution data once high-resolution images are introduced, maintaining or improving representation accuracy for all resolutions (Chaybouti et al., 23 Dec 2025).
- Hardware throughput increases from ~7.5k tokens/s to ~20k tokens/s per GPU with token-balanced batching.

5. Analysis and Theoretical Underpinnings

Theoretical analyses of token-balanced batching draw from both queueing theory and optimization:

M/G/1 Bulk-Service Models: Token-balanced batching is analyzed using M/G/1 (and M/Dⁿ/1) bulk-service queues, where batch processing time depends jointly on batch size and the longest token sequence. Mean waiting times admit closed-form upper bounds in terms of batch parameters, guiding the optimal choice of batch size and token limits for latency-sensitive applications (Yang et al., 2024).
Worst-Case Bounds: Imposing a maximum token-length per request (clipping at the 95th percentile, for example) provides strict upper bounds on per-batch latency, shrinking tail waiting times even if a small proportion of requests are curtailed (Yang et al., 2024).

A summary of queueing regimes and their impact:

Batching Policy	Queue Model	Service Time Dependency	Mean Waiting Time Improvement
Dynamic	M/G/1 bulk	$\text{maximize } B_{\text{tok}} \quad \text{subject to } \sum_{i} \text{KVsize}(i) \leq M_\text{threshold},\quad \text{chunk size} \leq S_\text{chunk}$ 4	Baseline
Fixed-size	M/Dⁿ/1	$\text{maximize } B_{\text{tok}} \quad \text{subject to } \sum_{i} \text{KVsize}(i) \leq M_\text{threshold},\quad \text{chunk size} \leq S_\text{chunk}$ 5	Cuts $\text{maximize } B_{\text{tok}} \quad \text{subject to } \sum_{i} \text{KVsize}(i) \leq M_\text{threshold},\quad \text{chunk size} \leq S_\text{chunk}$ 6 at optimal $\text{maximize } B_{\text{tok}} \quad \text{subject to } \sum_{i} \text{KVsize}(i) \leq M_\text{threshold},\quad \text{chunk size} \leq S_\text{chunk}$ 7
Elastic	--	Immediate completion per request	Reduces $\text{maximize } B_{\text{tok}} \quad \text{subject to } \sum_{i} \text{KVsize}(i) \leq M_\text{threshold},\quad \text{chunk size} \leq S_\text{chunk}$ 8 further by 30–50%

6. Implementation Considerations and Practicalities

Practical deployment of token-balanced batching requires attention to several factors:

Token Budget Selection: Empirical tuning of $\text{maximize } B_{\text{tok}} \quad \text{subject to } \sum_{i} \text{KVsize}(i) \leq M_\text{threshold},\quad \text{chunk size} \leq S_\text{chunk}$ 9 (vision) or $M_\text{threshold}$ 0 (LLM) is necessary to balance throughput, memory usage, and latency. In vision, $M_\text{threshold}$ 1 is typically set to just fit a 768×768 image (≈2,500–4,096 tokens).
Per-item Loss Normalization: When batch items contribute variable tokens, losses (e.g., MSE) are normalized by per-item token count ( $M_\text{threshold}$ 2) to prevent gradient bias towards longer samples (Chaybouti et al., 23 Dec 2025).
Attention Masking: Batching together distinct sequences or images in vision models requires attention masking (e.g., block-diagonal FlexAttention) to prevent cross-sample gradient contamination.
Workload Skew Management: Aggressive batch packing can risk starvation or over-representation of certain item types (e.g., low-res images or short LLM completions); scheduling heuristics and group sorting by decode-to-prefill ratio alleviate this.

7. Limitations and Trade-offs

Token-balanced batching, while highly effective, introduces several trade-offs:

Latency vs. Throughput: Maximizing token budget may delay short requests in low-load conditions; elastic/early-exit departures partially mitigate this.
Complex Scheduling Overhead: Construction of global prefix groups, per-group reordering, and memory tracking impose modest, though generally sub-second, scheduling overhead for thousands of requests (Zheng et al., 2024).
Fairness under Token Clipping: Setting token-length caps can sacrifice the experience of the small fraction of requests seeking very long outputs; selection of percentile thresholds balances user utility against system efficiency.

A plausible implication is that in multi-tenant production environments, token-balanced batching must be configured in conjunction with application-level SLOs to realize full system benefit.

References:

gLLM: (Guo et al., 21 Apr 2025) BatchLLM: (Zheng et al., 2024) AMoE: (Chaybouti et al., 23 Dec 2025) Queueing-Theoretic Analysis: (Yang et al., 2024)

Markdown Report Issue Upgrade to Chat

References (4)

AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model (2025)

A Queueing Theoretic Perspective on Low-Latency LLM Inference with Variable Token Length (2024)

gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling (2025)

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Token-Balanced Batching.

Token-Balanced Batching

1. Core Principles and Motivations

2. Algorithmic Approaches and Formulations

2.1. LLM Inference: Memory-Centric and Elastic Batching

2.2. Distributed Training: Input Packing for Variable-Length Data

3. Scheduling, Grouping, and Runtime Coordination

4. Empirical Results and Quantitative Impact

5. Analysis and Theoretical Underpinnings

6. Implementation Considerations and Practicalities

7. Limitations and Trade-offs

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Token-Balanced Batching

1. Core Principles and Motivations

2. Algorithmic Approaches and Formulations

2.1. LLM Inference: Memory-Centric and Elastic Batching

2.2. Distributed Training: Input Packing for Variable-Length Data

3. Scheduling, Grouping, and Runtime Coordination

4. Empirical Results and Quantitative Impact

5. Analysis and Theoretical Underpinnings

6. Implementation Considerations and Practicalities

7. Limitations and Trade-offs

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research