FILO Micro Batch Schedule in Distributed Training
- FILO micro batch scheduling is a method where forward passes follow FIFO order and backward passes are executed in reverse to maximize memory efficiency and throughput.
- The two-fold FILO strategy groups micro batches to balance activation memory and overlap communication with computation, effectively reducing pipeline idle time.
- Integration with attention parallel partition and chunked MLP drives significant throughput improvements and scalability for long-sequence models in distributed training.
A First-In-Last-Out (FILO) micro batch schedule, in the context of large-scale distributed neural network training—especially for long-sequence Transformers—refers to a policy where the system processes micro batches such that the backward passes are executed in the reverse order from the forward passes. This approach is distinctly harnessed in HelixPipe (“HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism” (2507.00394)) to optimize memory usage, overlap communication with computation, and reduce pipeline bubbles for long-sequence models. The use of FILO at the micro batch level also has historical precedent in the literature on batching strategies for online arrivals when latency and economies of scale trade-offs must be balanced (“Dynamic Batching of Online Arrivals to Leverage Economies of Scale” (2309.16911)), though typical batch processing in those contexts is agnostic to intra-batch order.
1. Fundamentals of the FILO Micro Batch Schedule
The core of the FILO micro batch schedule is that, once all micro batches are enqueued (the “first-in” phase), their forward passes are executed in arrival order. Afterward, backward passes are performed in the exact reverse order (“last-out”), following a LIFO (“stack-like”) arrangement on the micro batch level.
HelixPipe implements not only a FILO discipline but a two-fold FILO schedule, wherein at each scheduling step, two micro batches are handled together (“fold”), rather than one at a time. This design serves two critical purposes:
- Balanced Activation Memory: Each pipeline stage only needs to retain activations for the same number of outstanding micro batches at any moment, ensuring consistent, minimized memory requirements across stages.
- Overlapping Communication and Computation: The absence of data dependencies between micro batches allows communication for one batch to be overlapped with computation for another, thus hiding network delays and elevating throughput.
In conventional pipeline parallelism (e.g., 1F1B or “one forward, one backward” [1F1B]), micro batches make sequential progress through forward and backward phases, resulting in memory spikes and pipeline “bubbles.” The FILO micro batch schedule—particularly in the two-fold variant—ameliorates both effects by inverting the backward order and interleaving micro batch computation and communication.
2. Attention Parallel Partition and Its Interplay with FILO Scheduling
Pipeline parallelism assigns model layers, or segments, to each computational node. For long-sequence Transformers, the quadratic complexity of self-attention makes simple sequential pipelining inefficient, as attention computation disproportionately inflates pipeline bubbles.
HelixPipe introduces attention parallel partition, splitting each transformer layer into three parts (“pre-attention,” “attention,” and “post-attention”). The scheduling ensures the attention computations for different micro batches are parallelized across pipeline stages. Specifically:
- Each micro batch’s pre/post-attention computation is assigned in sequence, but its attention computation is routed such that at any time, different pipeline stages are concurrently processing the attention of different micro batches.
By combining this with the FILO micro batch schedule, HelixPipe removes attention computation from the critical path of pipeline bubbles. Since backward passes in FILO are strictly in reverse, and attention for distinct micro batches is decoupled across stages, communication and expensive attention computation are overlapped as much as possible.
3. Memory Optimization via Recomputation and Chunked MLP
Training with long sequences and large batch sizes frequently runs into GPU memory bottlenecks. FILO scheduling, when combined with selective recomputation and chunked processing, provides a powerful mitigation.
- Recomputation Without Attention: For the backward phase, only pre-attention and post-attention activations are recomputed (with attention handled via persistent storage or memory-efficient attention algorithms like FlashAttention). This reduces per-layer activation memory from a typical 16bsh to 4bsh.
- Chunked MLP: The backward path for MLP layers is subdivided (“chunked”) into many small segments (of configurable size), processed serially. Buffer reuse and careful preallocation prevent fragmentation, even at very long sequence lengths or under the two-fold FILO schedule.
Memory requirements are formally analyzed in the paper, with evidence that peak usage and fragmentation are minimized: $\begin{array}{lcc} \text{Pipeline} & \text{Pipeline bubble time} & \text{Activation memory} \ \hline \text{1F1B} & 3(p-1)(t_{pre}+t_{attn}+t_{post})L/p & 16(p-i)bshL/p \ \text{ZB1P} & (p-1)(t_{pre}+3t_{attn}+t_{post})L/p & 16bshL \ \text{HelixPipe} & 8(p-1)(t_{pre}+t_{post}) & 4bshmL/p \ \end{array}$ where is the number of pipeline stages, the number of layers, the batch size, the sequence length, the hidden size, and the micro batch fold count.
4. Performance and Scalability Outcomes
The FILO micro batch schedule, especially in its two-fold realization in HelixPipe, demonstrates substantial empirical improvements:
- Throughput and Bubble Reduction: By parallelizing attention and adopting two-fold FILO scheduling, HelixPipe removes attention from the pipeline bubble, significantly improving throughput for long-sequence models. For example, training a 7B parameter model with 128k sequence length on 64 H20 GPUs resulted in a 26% speedup over the best baseline.
- Memory Footprint: Peak activation memory is both minimized and balanced across pipeline stages, enabling successful training of previously infeasible long-sequence configurations (e.g., 128k tokens).
- Generalization Across Models: Gains were consistent for varying model sizes (1.3B, 3B, 7B) and across clusters (H20, A800). The approach remains effective in combination with other parallelisms (e.g., Megatron-LM’s sequence parallelism).
Supporting figures (Figure 6 and Figure 9 in the paper) show normalized throughput and memory consumption as a function of sequence length and pipeline depth.
5. Implementation and Software Integration
HelixPipe’s FILO micro batch scheduling is implemented as 3,400 lines of additional code atop Megatron-LM. The two-fold FILO strategy is tightly integrated with attention parallel partition, custom buffer management, and chunked MLP logic. Public code is released at https://github.com/code-tunnel/Megatron-LM/tree/dev, facilitating reproduction and benchmarking.
Deployment involves:
- Configuring pipelines with micro batch size of 1 and global batch size equal to twice the number of pipeline stages.
- Custom buffer preallocation and management to prevent fragmentation at large scale.
- Combining with various parallelism approaches, with tested compatibility for long sequences.
A plausible implication is that FILO micro batch scheduling, as operationalized in HelixPipe, can be retrofitted to contemporary frameworks focused on long sequence model training, provided attention computation is separable and communication/computation overlapping is possible.
6. Context and Divergence from General Batching Strategies
The focus of the FILO micro batch schedule in distributed DNN training differs notably from classic online batching problems as found in “Dynamic Batching of Online Arrivals to Leverage Economies of Scale” (2309.16911). In traditional settings, FILO refers to stack-like policies at the item service level, but in batch-oriented large-scale model training:
- Micro batches are treated in FILO order for backward phase, but batches themselves are processed atomically, with cost and latency dominated by batch-level decisions.
- The economic rationale in classic work derives from the concavity of the batch processing cost function and the need to minimize a linear combination of waiting time and processing cost. In distributed deep learning, the motivation is primarily memory balancing and pipeline utilization, not cost per item or latency minimization.
- There is no intra-batch service ordering (FILO or FIFO within a batch) in either setting. The innovation resides entirely in macro scheduling.
In summary, the First-In-Last-Out micro batch schedule is a modern pipelining and memory optimization strategy for distributed neural network training. Its principal significance lies in balancing activation memory requirements, reducing pipeline bubbles, and maximizing overlap of network communication with computation—not in controlling service order of individual tasks. The empirical improvements documented in large-scale deployments for long-sequence transformer models substantiate its effectiveness as a core scheduling component in current distributed training frameworks.