Continuous Depth-wise Batching

Updated 6 May 2026

Continuous depth-wise batching is a dynamic inference paradigm that leverages weight-shared, recursive Transformers to batch processing across both token time and layer depth.
It employs early-exit mechanisms and FIFO scheduling to maximize accelerator utilization, yielding a 2–3× speedup in autoregressive token generation.
Empirical results on models like Gemma 2B demonstrate practical throughput gains, validating the efficiency of batching across both sequence and depth dimensions.

Continuous Depth-wise Batching (CDB) is a dynamic inference paradigm that leverages the parameter-tied (recursive) structure of Transformers to pipeline sequence processing not only in the conventional time (token) dimension, but also across model depth (layer iterations). CDB exploits repeated blocks within a recursive Transformer—enabled by weight sharing across layers—to maximize hardware utilization and reduce idle compute slots, achieving substantial improvements in throughput for autoregressive generation, especially when combined with early-exit mechanisms. When deployed in practice, CDB demonstrates a 2–3× increase in token-generation throughput relative to static, layer-distinct Transformers, subject to model configuration and early-exit dynamics (Bae et al., 2024).

1. Formal Definition and Mechanism

Continuous Depth-wise Batching operates in recursive Transformers constructed from a total depth $L$ , reorganized as $B$ looping blocks of size $K=L/B$ , with $f_k(h;\,\Phi'_k)$ denoting the $k$ th layer of a shared block. For token $t$ and layer index $\ell$ , the forward pass is:

$h_{t}^{(\ell)} = f_{((\ell-1)\bmod K)+1}\left(h_{t}^{(\ell-1)};\,\Phi'_{((\ell-1)\bmod K)+1}\right)$

CDB maintains, during runtime, up to $N_{\max}$ active samples at each block-iteration stage $j=1,\ldots,B$ . For each compute step, up to $B$ 0 samples in $B$ 1 are processed in parallel by the corresponding shared block $B$ 2. Freed slots are immediately filled with either survivors from deeper iterations or new requests entering at the initial block stage. This batched scheduling occurs in both time (token) and depth (block iteration) dimensions and maximizes accelerator utilization for each block function (Bae et al., 2024).

2. Scheduling: Pseudocode and Operational Overview

A high-level scheduling procedure for CDB paired with early-exiting utilizes $B$ 3 separate FIFO queues, one per block iteration. At each scheduler tick, a batch is formed (up to $B$ 4) for each queue, the shared block $B$ 5 is applied, and samples either exit (if the early-exit criterion is satisfied) or proceed to the next block. New requests enter at the first block whenever there is available capacity. The general scheduler pseudocode is:

$f_k(h;\,\Phi'_k)$ 1

Key features include up to $B$ 6 simultaneous batches (one per block), flexible early-exiting, and backfilling to maintain throughput. Batching is conducted across samples invoking the same block parameters, which is feasible only in weight-sharing architectures (Bae et al., 2024).

3. Theoretical and Empirical Throughput Gains

CDB’s effectiveness is quantified against two baselines: static synchronous batching and continuous sequence-wise batching (CSB):

Static batching: All $B$ 7 slots process in lock-step with throughput $B$ 8.
CSB: Batching across token sequences at the same block depth, empirically yielding a speedup $B$ 9 (e.g., Gemma 2B: 1080 tok/s → 1528 tok/s).
CDB: Enables $K=L/B$ 0-fold depth-wise pipeline; for $K=L/B$ 1 and $K=L/B$ 2, this yields a theoretical $K=L/B$ 3 speedup.

With early-exit, if the mean exit depth is $K=L/B$ 4, the depth-wise pipeline is shortened, and the effective speedup approaches $K=L/B$ 5. For example, with Gemma 2B and $K=L/B$ 6, the observed throughput was $K=L/B$ 7 tok/s, corresponding to a $K=L/B$ 8 speedup (Bae et al., 2024).

4. Experimental Configurations and Results

Key configuration details include:

Models evaluated: Gemma 2B (18 layers, 2 blocks of 9), TinyLlama 1.1B (22 layers, 2 blocks of 11), Pythia 1B (16 layers, 2 blocks of 8).
Batch size: $K=L/B$ 9.
Early-exit criterion: Confidence score (e.g., max-token log-probability) checked after each block iteration; oracle simulations provided idealized throughput.
Hardware profiling: V100/A100 GPU, per-block timing denoted by $f_k(h;\,\Phi'_k)$ 0.
Token-generation throughput (Gemma 2B, SlimPajama/RedPajama/PG19):
- Stat

Markdown Report Issue Upgrade to Chat

References (1)

Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Continuous Depth-wise Batching.