LLaMA-Factory Framework
- LLaMA-Factory is a unified framework enabling efficient fine-tuning and post-training of LLMs by integrating state-of-the-art PEFT techniques and plug-and-play sequence parallelism.
- It employs a modular architecture with distinct Model Loader, Data Worker, and Trainer components, orchestrated via a Gradio-based LlamaBoard for no-code configuration and real-time monitoring.
- Its extension, 360-LLaMA-Factory, supports long-context processing through advanced attention methods like DeepSpeed-Ulysses and Ring-Attention, validated by empirical benchmarks for memory and throughput improvements.
LLaMA-Factory is a unified, extensible framework for efficient fine-tuning and post-training of LLMs, integrating state-of-the-art parameter-efficient training (PEFT) techniques and advanced distributed sequence-parallelism. Initially open-sourced at https://github.com/hiyouga/LLaMA-Factory, it provides a high-level interface (LlamaBoard) for no-code configuration and monitoring. Its extension, 360-LLaMA-Factory, introduces plug-and-play sequence parallelism targeted at long contexts, supporting multiple attention distribution strategies and adaptive integration into popular LLMs (Zheng et al., 20 Mar 2024, Zou et al., 28 May 2025).
1. Architectural Overview and Workflow
LLaMA-Factory’s architecture comprises three core, loosely-coupled modules: Model Loader, Data Worker, and Trainer, orchestrated via the Gradio-based web UI LlamaBoard. The canonical data flow is: $\underbrace{\text{Model Loader}_{\substack{\text{init / patch / quantize /}\ \text{attach adapters} \;\longrightarrow\; \underbrace{\text{Data Worker}_{\substack{\text{load / align /}\ \text{merge / preprocess} \;\longrightarrow\; \underbrace{\text{Trainer}_{\substack{\text{pre‐training / SFT /}\ \text{RLHF / DPO + PEFT} \;\longrightarrow\; \bigl(\text{Fine‐tuned Model},\;\text{Metrics},\;\text{Logs}\bigr)}}$ LlamaBoard enables form-based configuration, real-time metrics, and multilingual access (English, Chinese, Russian).
360-LLaMA-Factory extends this architecture by inserting a sequence-parallel “attention patch” applicable via a single API call. Sequence-parallel groups split the token dimension (not batch/model) and route attention computation via either all-to-all or ring communication (Zou et al., 28 May 2025).
2. Integrated Fine-Tuning and PEFT Methods
LLaMA-Factory standardizes several PEFT and memory/computation optimizations:
- LoRA (Low-Rank Adaptation): Let be a linear layer; LoRA freezes and computes , , , fine-tuning only .
- QLoRA: Quantizes to 4 bits (“double quant”), with LoRA adapters applied on top; reduces memory from 16 B/param to ~0.6 B/param.
- DoRA: Decomposes , applies LoRA only to .
- GaLore: Projects gradients into a low-rank subspace: (, ), enabling full-parameter updates with reduced memory.
- Freeze-tuning: Most layers are frozen; only a subset of decoder blocks are fine-tuned.
- LoRA+: Improves LoRA’s rank-stability via revised initialization and scaling.
In addition, activation checkpointing, mixed precision, Flash Attention, S² attention (for extended context), and Unsloth’s Triton-based backward pass facilitate further acceleration and memory savings.
3. Sequence Parallelism in 360-LLaMA-Factory
Sequence parallelism enables training with very long sequences by distributing the token dimension across GPUs. 360-LLaMA-Factory implements two principal sequence-parallel attention modes:
DeepSpeed-Ulysses
- Splits the input sequence into chunks of length across GPUs.
- Performs all-to-all of Q/K/V, reassigning entire attention heads per GPU.
- Forward and backward passes each require all-to-all; outputs concatenated along head dimension.
- Loss gradients summed across GPUs via
torch.distributed.nn.all_reduce. - Communication cost: per layer.
Ring-Attention
- Also splits sequences into chunks, but keeps all heads local.
- Each GPU sends its K/V chunk to its neighbor in a ring, attending Q over received chunks.
- Pure peer-to-peer communication, overlapping send/recv.
- Communication cost: per layer.
Hybrid approaches (USP, Xtuner-Ulysses) and Dummy-Head Ulysses (which pads attention heads to meet divisibility constraints) allow further flexibility. Correctness benchmarks confirm overlap of loss curves with non-parallel baselines for both SFT and DPO (see Figures 6–7 (Zou et al., 28 May 2025)).
4. Model Coverage and Generalization
Utilizing a "model registry" for automated layer detection and patching, LLaMA-Factory supports over 100 open-source LLMs, including:
| Model Family | Sizes Supported | Notable Architectures |
|---|---|---|
| LLaMA | 7B/13B/33B/65B | Decoder-only |
| LLaMA2 | 7B/13B/70B | Decoder-only |
| Baichuan | 7B/13B | |
| BLOOM | 560M–7.1B | MoE variants |
| ChatGLM | 2/3 (6B) | Encoder–decoder hybrids |
| Falcon | 7B/40B/180B | |
| Mistral | 7B | |
| Gemma | 2B/7B | |
| Qwen | 1.8B–72B | |
| Yi | 6B–34B |
Adapters and attention-patching logic are generalized to decoders, encoder–decoder hybrids, and MoE architectures via consistent identification of Linear modules (Zheng et al., 20 Mar 2024).
5. Empirical Benchmarks and Performance Analysis
Comprehensive benchmarks validate the memory, throughput, and quality improvements provided by PEFT and sequence-parallel modules.
Fine-Tuning Efficiency (PubMed, one A100-40 GB):
| Method | Gemma-2B Mem/Speed/PPL | Llama2-7B Mem/Speed/PPL | Llama2-13B Mem/Speed/PPL |
|---|---|---|---|
| Full-tuning | 17.06 GB / 3090 / 10.34 | 38.72 GB / 1335 / 5.56 | — |
| Freeze-tuning | 8.10 GB / 5608 / 11.33 | 15.69 GB / 2905 / 6.59 | 29.02 GB / 1841 / 6.56 |
| GaLore | 10.16 GB / 2483 / 10.38 | 15.43 GB / 1584 / 5.88 | 28.91 GB / 956 / 5.72 |
| LoRA | 7.91 GB / 3521 / 10.19 | 16.32 GB / 1954 / 5.81 | 30.09 GB / 1468 / 5.75 |
| QLoRA | 5.21 GB / 3159 / 10.46 | 7.52 GB / 1579 / 5.91 | 12.61 GB / 974 / 5.81 |
LoRA/QLoRA typically achieve comparable quality to full-tuning, while yielding 2–4× memory savings and 1.5–3× speedups (Zheng et al., 20 Mar 2024).
Sequence-parallel scalability (A100×8, bfloat16):
Maximum context length and throughput increase nearly linearly with :
| Model | SFT (Max L, DS-Ulysses/Ring-Attn) | DPO (Max L, DS-Ulysses/Ring-Attn) |
|---|---|---|
| Qwen2.5-7B, sp=4 | 86k / 96k | 38k / 34k |
| Qwen2.5-7B, sp=8 | 166k / 182k | 76k / 60k |
Dummy-Head Ulysses matches DS-Ulysses when divisibility constraints hold; outperforms other head-splitting methods when they do not (see Table 4 (Zou et al., 28 May 2025)).
6. Usage Examples and Implementation Details
CLI:
1 2 3 4 5 6 7 8 9 |
llamafactory run \ --model_name_or_path meta-llama/Llama-2-7b \ --dataset alpaca \ --tuning_method lora \ --lora_rank 128 --lora_alpha 256 \ --per_device_train_batch_size 4 \ --learning_rate 1e-5 \ --max_seq_length 2048 \ --output_dir ./output/llama2-lora |
Python API:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
from llamafactory import LlamaFactory cfg = { "model": "meta-llama/Llama-2-7b", "method": "lora", "lora_rank": 128, "lora_alpha": 256, "dataset": "alpaca", "train_batch_size": 4, "learning_rate": 1e-5, "max_seq_len": 2048, "output_dir": "./out" } lf = LlamaFactory(cfg) lf.run() |
7. Limitations, Trade-Offs, and Prospective Developments
Notable limitations include the inability of adapter-only PEFT methods (LoRA, QLoRA, DoRA) to update layernorm or embedding biases unless combined with freeze-tuning; quantized fine-tuning is currently restricted to adapter-based methods. In 360-LLaMA-Factory, Dummy-Head padding introduces minor memory overhead, and DPO loss can exhibit small variance. The “model-sharing RLHF” architecture yields GPU savings at the expense of modular clarity in policy/reward/reference separation.
Recommended hardware includes NVSwitch-connected A100/H100 nodes to minimize all-to-all communication latency. Future work targets continuous model registry expansion, support for novel LLM and PEFT methods, advanced parallel sharding schemes, extension to multimodal PEFT (vision+language), and further optimizations for distributed training – including possible 2D/3D sequence-tensor parallelism and precomputation of DPO reference outputs (Zheng et al., 20 Mar 2024, Zou et al., 28 May 2025).
A plausible implication is that LLaMA-Factory’s approach to standardized adapters and plug-and-play distributed attention lowers the entry barrier for efficient, large-context LLM adaptation across a wide model zoo, serving both academic and industrial practitioners.