360-LLaMA-Factory: Plug & Play Sequence Parallelism for Long Post-Training (2505.22296v1)

Published 28 May 2025 in cs.CL and cs.LG

Abstract: Adding sequence parallelism into LLaMA-Factory, we open-sourced 360-LLaMA-Factory at https://github.com/Qihoo360/360-LLaMA-Factory. 360-LLaMA-Factory has received wide recognition and used in models such as Light-R1 arXiv:2503.10460, TinyR1 arXiv:2503.04872, Kaggle AIMO math models and also in large companies' training frameworks. This technical report delves deeper into the different sequence parallel modes behind 360-LLaMA-Factory and discusses our implementation insights.

PDF Abstract

This technical report describes the integration of sequence parallelism techniques, specifically DeepSpeed-Ulysses [jacobs2023deepspeed] and Ring-Attention [liu2023ring], into the LLaMA-Factory framework [zheng2024llamafactory] to enable training LLMs on extremely long sequences. The authors open-sourced their implementation as 360-LLaMA-Factory (Zou et al., 28 May 2025 ), aiming to provide a plug-and-play solution for long context post-training like Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO).

Core Problem: As LLMs process increasingly long contexts (up to millions of tokens), existing fine-tuning frameworks often lack efficient mechanisms to handle such lengths due to GPU memory constraints. Sequence parallelism partitions the input sequence across multiple devices, allowing models to train on longer sequences than would fit on a single GPU or with other parallelism methods alone. While sequence parallelism methods like Ulysses and Ring-Attention exist, their practical implementation in user-friendly frameworks and detailed comparison remain underexplored.

Proposed Solution: The authors integrated DeepSpeed-Ulysses and Ring-Attention into LLaMA-Factory, making it compatible with existing features like LoRA [hu2022lora] and neat-packing [kundu2024enhancing]. A key contribution is the introduction of Dummy-Head Ulysses, a simple and efficient extension to DeepSpeed-Ulysses that addresses its limitation requiring the number of attention heads to be divisible by the sequence parallel size.

Implementation Details:

Sequence Parallel Initialization:
- GPUs are grouped based on the specified sequence parallel size ( $sp$ ).
- The default attention function in the model is replaced (monkey-patched) with the Ring-Attention or DeepSpeed-Ulysses implementation before the model is loaded.
Sequentially Parallel Data Processing:
- Input sequences are padded to a length divisible by $8 \times sp$ (and also close to the maximum cutoff_len).
- Padded data is partitioned across GPUs in a sequence parallel group.
- DeepSpeed-Ulysses uses a simple sequential split.
- Ring-Attention employs a zigzag split for better load balancing.
- For compatibility with neat-packing in DeepSpeed-Ulysses, the attention_mask is not split but copied across all GPUs in the group.
Correct Loss Calculation:
- In sequence parallelism, each GPU only computes a partial loss based on its segment of the sequence.
- The final loss requires aggregating these partial losses using an all_reduce operation across the sequence parallel group.
- Crucially, the authors highlight the need to use torch.distributed.nn.all_reduce instead of torch.distributed.all_reduce. The former correctly handles gradient backpropagation through the communication operation, while the latter does not, leading to incorrect gradient scaling (a factor equal to the sequence parallel size) as demonstrated in their experiments.
- For DPO loss, which involves a sigmoid function applied to log probabilities, direct all_reduce on the final loss is incorrect. Instead, they perform all_reduce on the individual log probability terms ( $\log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)}$ and $\log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}$ ) across GPUs before computing the final DPO loss using the aggregated terms.
Handling Position IDs:
- For models using Rotary Position Embedding (RoPE) [su2024roformer], simply splitting the sequence and letting the model generate default position IDs ([0, 1, ..., seq_len/sp - 1]) on each GPU is wrong.
- The position_ids must reflect the original positions in the full sequence.
- Therefore, position_ids need to be pre-calculated for the full sequence, partitioned according to the sequence split, and explicitly passed to the model's forward pass on each GPU.
Dummy-Head Ulysses Implementation:
- Problem: Standard DeepSpeed-Ulysses requires the number of attention heads (hs) to be divisible by the sequence parallel size (sp).
- Solution: If hs % sp != 0, they pad the head dimension with zero-valued "dummy heads" to make the total number of heads (hs_new) divisible by sp.
- The input tensor (query, key, value) is padded along the head dimension before the all-to-all operation.
- The extra padded heads are removed during the backward pass to ensure correctness.
- Pseudocode is provided for pad_heads and unpad_heads.
- This method is shown to be more memory-efficient and computationally less expensive than Xtuner's approach [2023xtuner], which achieves divisibility by partitioning the hidden dimension and using internal all-gather operations, effectively recalculating the hidden dimension multiple times.

Communication and Time Complexity:

The paper provides a theoretical comparison (Table 1) of communication and time complexities for different sequence parallel methods:

DeepSpeed-Ulysses: $O(\frac{8}{N} \times bs \times seq\_len \times d)$ communication. $O(bs \times seq\_len^2 \times \frac{d}{N})$ time.
Ring-Attention: $O(4 \times bs \times seq\_len \times d)$ communication. $O(bs \times seq\_len^2 \times \frac{d}{N})$ time. Generally higher communication than Ulysses for $N > 2$ .
Dummy-Head-Ulysses: $O(\frac{hs_{new}}{hs} \times \frac{8}{N} \times bs \times seq\_len \times d)$ communication. $O(\frac{hs_{new}}{hs} \times bs \times seq\_len^2 \times \frac{d}{N})$ time. Overhead is minimal if few dummy heads are added.
Xtuner-Ulysses: Higher communication and time complexity due to additional all-gather and repeated hidden dimension computation.

Experimental Validation:

Correctness Verification:
- Trained Qwen2.5-0.5B-Instruct on SFT and DPO tasks with and without sequence parallelism (using 2x A100 for SP, 1x A100 without).
- Showed that loss curves for DeepSpeed-Ulysses and Ring-Attention implementations closely match the non-sequence-parallel baseline, validating the implementation correctness, including the gradient aggregation using nn.all_reduce.
Maximum Sequence Length Comparison:
- Evaluated max sequence length for SFT and DPO on Qwen2.5 models (7B, 14B, 72B) using 8x A100 (32x A100 for 72B) with sequence parallel sizes 4 and 8.
- Sequence parallelism significantly increases the maximum trainable sequence length.
- DeepSpeed-Ulysses generally supports longer sequences for DPO, while Ring-Attention supports longer sequences for SFT.
- Performance on 72B model shows potential issues with inter-node communication overhead for Ring-Attention.
Throughput Comparison:
- Compared throughput (Tokens/s) of Ulysses, Dummy-Head Ulysses (DHU), Xtuner-Ulysses (XU), USP, and Ring-Attention (RA) on various Qwen2.5 models with sequence parallel size 8 (on 8x A100).
- DHU consistently shows higher throughput than XU, confirming its efficiency advantage.
- Ulysses often achieves optimal throughput, but USP (specifically USP-u4) can outperform it in some cases. The authors note that Ulysses performance can be affected by the need to use flash_attn_varlen_func for neat-packing compatibility, which adds overhead in padding-free scenarios used in their throughput tests.

Practical Implications & Applications:

The 360-LLaMA-Factory provides a practical, open-source toolkit for practitioners to fine-tune LLMs on long sequences using readily available sequence parallelism methods.
The integration into the popular LLaMA-Factory framework makes these advanced techniques accessible with minimal code changes (claiming just one extra line).
The Dummy-Head Ulysses specifically addresses a common constraint of DeepSpeed-Ulysses, making it more broadly applicable across different models regardless of the head count's divisibility by the parallel size, without incurring significant performance penalties like alternative methods.
The detailed analysis of practical issues like correct distributed communication (nn.all_reduce vs. all_reduce) and position ID handling provides crucial guidance for anyone implementing sequence parallelism.
The performance comparisons help practitioners choose the most suitable sequence parallelism method (Ulysses, Ring-Attention) depending on the task (SFT vs. DPO) and model size, considering factors like maximum sequence length and throughput on their specific hardware.

Limitations and Future Work:

The authors acknowledge that Dummy-Head Ulysses introduces some overhead they wish to reduce, and there's a small observed discrepancy in DPO loss they aim to minimize. Future work includes expanding model support (multimodal), exploring more efficient sequence parallelism strategies, and optimizing training workflows (e.g., precomputing reference model outputs for DPO).

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Haosheng Zou (6 papers)
Xiaowei Lv (5 papers)
Shousheng Jia (2 papers)
Xiangzheng Zhang (10 papers)

Related Papers

Find Related Papers

GitHub

GitHub - Qihoo360/360-LLaMA-Factory: adds Sequence Parallelism into LLaMA-Factory (493 stars)