Papers
Topics
Authors
Recent
2000 character limit reached

LLaMA-Factory Framework

Updated 9 December 2025
  • LLaMA-Factory is a unified framework enabling efficient fine-tuning and post-training of LLMs by integrating state-of-the-art PEFT techniques and plug-and-play sequence parallelism.
  • It employs a modular architecture with distinct Model Loader, Data Worker, and Trainer components, orchestrated via a Gradio-based LlamaBoard for no-code configuration and real-time monitoring.
  • Its extension, 360-LLaMA-Factory, supports long-context processing through advanced attention methods like DeepSpeed-Ulysses and Ring-Attention, validated by empirical benchmarks for memory and throughput improvements.

LLaMA-Factory is a unified, extensible framework for efficient fine-tuning and post-training of LLMs, integrating state-of-the-art parameter-efficient training (PEFT) techniques and advanced distributed sequence-parallelism. Initially open-sourced at https://github.com/hiyouga/LLaMA-Factory, it provides a high-level interface (LlamaBoard) for no-code configuration and monitoring. Its extension, 360-LLaMA-Factory, introduces plug-and-play sequence parallelism targeted at long contexts, supporting multiple attention distribution strategies and adaptive integration into popular LLMs (Zheng et al., 20 Mar 2024, Zou et al., 28 May 2025).

1. Architectural Overview and Workflow

LLaMA-Factory’s architecture comprises three core, loosely-coupled modules: Model Loader, Data Worker, and Trainer, orchestrated via the Gradio-based web UI LlamaBoard. The canonical data flow is: $\underbrace{\text{Model Loader}_{\substack{\text{init / patch / quantize /}\ \text{attach adapters} \;\longrightarrow\; \underbrace{\text{Data Worker}_{\substack{\text{load / align /}\ \text{merge / preprocess} \;\longrightarrow\; \underbrace{\text{Trainer}_{\substack{\text{pre‐training / SFT /}\ \text{RLHF / DPO + PEFT} \;\longrightarrow\; \bigl(\text{Fine‐tuned Model},\;\text{Metrics},\;\text{Logs}\bigr)}}$ LlamaBoard enables form-based configuration, real-time metrics, and multilingual access (English, Chinese, Russian).

360-LLaMA-Factory extends this architecture by inserting a sequence-parallel “attention patch” applicable via a single API call. Sequence-parallel groups split the token dimension (not batch/model) and route attention computation via either all-to-all or ring communication (Zou et al., 28 May 2025).

2. Integrated Fine-Tuning and PEFT Methods

LLaMA-Factory standardizes several PEFT and memory/computation optimizations:

  • LoRA (Low-Rank Adaptation): Let WRdout×dinW\in\mathbb{R}^{d_\text{out}\times d_\text{in}} be a linear layer; LoRA freezes WW and computes ΔW=αrAB\Delta W = \frac{\alpha}{r}AB, ARdout×rA\in\mathbb{R}^{d_{out}\times r}, BRr×dinB\in\mathbb{R}^{r\times d_{in}}, fine-tuning only {A,B}\{A,B\}.
  • QLoRA: Quantizes WW to 4 bits (“double quant”), with LoRA adapters applied on top; reduces memory from 16 B/param to ~0.6 B/param.
  • DoRA: Decomposes W=WFWˉW = \|W\|_F\bar{W}, applies LoRA only to Wˉ\bar{W}.
  • GaLore: Projects gradients into a low-rank subspace: WUV\nabla_W\ell\approx UV (URdout×kU\in\mathbb{R}^{d_{out}\times k}, VRk×dinV\in\mathbb{R}^{k\times d_{in}}), enabling full-parameter updates with reduced memory.
  • Freeze-tuning: Most layers are frozen; only a subset of decoder blocks are fine-tuned.
  • LoRA+: Improves LoRA’s rank-stability via revised initialization and scaling.

In addition, activation checkpointing, mixed precision, Flash Attention, S² attention (for extended context), and Unsloth’s Triton-based backward pass facilitate further acceleration and memory savings.

3. Sequence Parallelism in 360-LLaMA-Factory

Sequence parallelism enables training with very long sequences by distributing the token dimension across GPUs. 360-LLaMA-Factory implements two principal sequence-parallel attention modes:

DeepSpeed-Ulysses

  • Splits the input sequence into spsp chunks of length L/spL/sp across spsp GPUs.
  • Performs all-to-all of Q/K/V, reassigning entire attention heads per GPU.
  • Forward and backward passes each require all-to-all; outputs concatenated along head dimension.
  • Loss gradients summed across GPUs via torch.distributed.nn.all_reduce.
  • Communication cost: O((8/P)BLH)O((8/P)B L H) per layer.

Ring-Attention

  • Also splits sequences into spsp chunks, but keeps all heads local.
  • Each GPU sends its K/V chunk to its neighbor in a ring, attending Q over received chunks.
  • Pure peer-to-peer communication, overlapping send/recv.
  • Communication cost: O(4BLH)O(4 B L H) per layer.

Hybrid approaches (USP, Xtuner-Ulysses) and Dummy-Head Ulysses (which pads attention heads to meet divisibility constraints) allow further flexibility. Correctness benchmarks confirm overlap of loss curves with non-parallel baselines for both SFT and DPO (see Figures 6–7 (Zou et al., 28 May 2025)).

4. Model Coverage and Generalization

Utilizing a "model registry" for automated layer detection and patching, LLaMA-Factory supports over 100 open-source LLMs, including:

Model Family Sizes Supported Notable Architectures
LLaMA 7B/13B/33B/65B Decoder-only
LLaMA2 7B/13B/70B Decoder-only
Baichuan 7B/13B
BLOOM 560M–7.1B MoE variants
ChatGLM 2/3 (6B) Encoder–decoder hybrids
Falcon 7B/40B/180B
Mistral 7B
Gemma 2B/7B
Qwen 1.8B–72B
Yi 6B–34B

Adapters and attention-patching logic are generalized to decoders, encoder–decoder hybrids, and MoE architectures via consistent identification of Linear modules (Zheng et al., 20 Mar 2024).

5. Empirical Benchmarks and Performance Analysis

Comprehensive benchmarks validate the memory, throughput, and quality improvements provided by PEFT and sequence-parallel modules.

Fine-Tuning Efficiency (PubMed, one A100-40 GB):

Method Gemma-2B Mem/Speed/PPL Llama2-7B Mem/Speed/PPL Llama2-13B Mem/Speed/PPL
Full-tuning 17.06 GB / 3090 / 10.34 38.72 GB / 1335 / 5.56
Freeze-tuning 8.10 GB / 5608 / 11.33 15.69 GB / 2905 / 6.59 29.02 GB / 1841 / 6.56
GaLore 10.16 GB / 2483 / 10.38 15.43 GB / 1584 / 5.88 28.91 GB / 956 / 5.72
LoRA 7.91 GB / 3521 / 10.19 16.32 GB / 1954 / 5.81 30.09 GB / 1468 / 5.75
QLoRA 5.21 GB / 3159 / 10.46 7.52 GB / 1579 / 5.91 12.61 GB / 974 / 5.81

LoRA/QLoRA typically achieve comparable quality to full-tuning, while yielding 2–4× memory savings and 1.5–3× speedups (Zheng et al., 20 Mar 2024).

Sequence-parallel scalability (A100×8, bfloat16):

Maximum context length and throughput increase nearly linearly with spsp:

Model SFT (Max L, DS-Ulysses/Ring-Attn) DPO (Max L, DS-Ulysses/Ring-Attn)
Qwen2.5-7B, sp=4 86k / 96k 38k / 34k
Qwen2.5-7B, sp=8 166k / 182k 76k / 60k

Dummy-Head Ulysses matches DS-Ulysses when divisibility constraints hold; outperforms other head-splitting methods when they do not (see Table 4 (Zou et al., 28 May 2025)).

6. Usage Examples and Implementation Details

CLI:

1
2
3
4
5
6
7
8
9
llamafactory run \
  --model_name_or_path meta-llama/Llama-2-7b \
  --dataset alpaca \
  --tuning_method lora \
  --lora_rank 128 --lora_alpha 256 \
  --per_device_train_batch_size 4 \
  --learning_rate 1e-5 \
  --max_seq_length 2048 \
  --output_dir ./output/llama2-lora

Python API:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from llamafactory import LlamaFactory

cfg = {
   "model": "meta-llama/Llama-2-7b",
   "method": "lora",
   "lora_rank": 128,
   "lora_alpha": 256,
   "dataset": "alpaca",
   "train_batch_size": 4,
   "learning_rate": 1e-5,
   "max_seq_len": 2048,
   "output_dir": "./out"
}
lf = LlamaFactory(cfg)
lf.run()
Integration of sequence parallelism requires minimal modification, typically a single function replacing the attention mechanism and handling token splits, padding, and distributed loss reduction (Zou et al., 28 May 2025). Attention masking and position IDs must be broadcast correctly; sequences should be padded to 8×sp8 \times sp for optimal FlashAttention performance.

7. Limitations, Trade-Offs, and Prospective Developments

Notable limitations include the inability of adapter-only PEFT methods (LoRA, QLoRA, DoRA) to update layernorm or embedding biases unless combined with freeze-tuning; quantized fine-tuning is currently restricted to adapter-based methods. In 360-LLaMA-Factory, Dummy-Head padding introduces minor memory overhead, and DPO loss can exhibit small variance. The “model-sharing RLHF” architecture yields GPU savings at the expense of modular clarity in policy/reward/reference separation.

Recommended hardware includes NVSwitch-connected A100/H100 nodes to minimize all-to-all communication latency. Future work targets continuous model registry expansion, support for novel LLM and PEFT methods, advanced parallel sharding schemes, extension to multimodal PEFT (vision+language), and further optimizations for distributed training – including possible 2D/3D sequence-tensor parallelism and precomputation of DPO reference outputs (Zheng et al., 20 Mar 2024, Zou et al., 28 May 2025).

A plausible implication is that LLaMA-Factory’s approach to standardized adapters and plug-and-play distributed attention lowers the entry barrier for efficient, large-context LLM adaptation across a wide model zoo, serving both academic and industrial practitioners.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to LLaMA-Factory.