PEFT-Factory: Efficient Fine-Tuning Framework

Updated 29 November 2025

PEFT-Factory is a unified framework for parameter-efficient fine-tuning that combines lightweight adapters and routing methods for large pre-trained models.
It reduces parameter and memory costs by dynamically integrating low-rank modules in both dense transformers and Mixture-of-Experts architectures.
The framework supports diverse integration schemas—serial, parallel, and routed—while delivering notable performance gains on tasks like commonsense and arithmetic reasoning.

A Parameter-Efficient Fine-Tuning (PEFT) Factory refers to a unified, modular framework encompassing design principles and scalable implementations for rapidly instantiating and combining parameter-efficient fine-tuning methods—primarily adapters and routing mechanisms—in large-scale, frozen pre-trained models. The "PEFT-Factory" paradigm, as formalized in recent work, provides a systematic approach to integrating, routing, and managing low-rank or adapter-based fine-tuning strategies both for Mixture-of-Experts (MoE) architectures and dense pretrained models (Liu et al., 12 Nov 2024, Kwak et al., 29 Jan 2024).

1. Core Concepts and Motivation

PEFT-Factory systems address the challenge of adapting massive pre-trained models to diverse tasks or profiles without incurring prohibitive parameter and memory costs. Instead of full-model fine-tuning, PEFT-Factory orchestrates the insertion of lightweight "adapters"—low-rank modules or small learned projections—either as independent modules or via learned routing. The framework formalizes adapter integration for both standard transformer models and MoE variants, supporting flexible routing and fine-grained activation, and enables efficient multi-profile deployments using tiny profile-specific "mask" tensors or router weights. This approach reduces per-task learnable parameters and storage overhead by several orders of magnitude compared to conventional fine-tuning.

2. Design Dimensions in PEFT-Factory

Fundamental PEFT-Factory design dimensions capture both functional aspects of adapters and their composition within MoE or dense transformer layers:

Functional Dimensions:

Adapter Architecture: Adapters are typically two-layer bottleneck modules (down-projection, activation, up-projection), with low-rank LoRA-style forms as a central instantiation: $\Delta(h) = h V U^{\top}$ .
Multiplicity: Factories may construct $M > 1$ adapters per location, supporting multiple expert modules in parallel.
Routing: Optional learnable routers $\tilde G$ assign dynamic weights to adapter activations, enabling token-wise mixture-of-adapters or selective gating.

Compositional Dimensions:

Shared PEFT Experts: Adapters are added in parallel to all experts or layers.
Embedded PEFT Experts: Each MoE expert is coupled with an individual PEFT adapter and possibly shares the expert router.
Agnostic PEFT: Adapters disregard expert structure, mirroring standard LoRA-attention placement.

Table 1: Adapter Integration Types | Placement | Adapter Routing | MoE Coupling | |---------------------|---------------------|----------------| | Shared (Parallel) | Optional ( $\tilde G$ ) | All Experts | | Embedded | Pretrained ( $G$ ) | Per-Expert | | Agnostic | None | Dense Layers |

3. Composition Strategies and Forward Mechanisms

The PEFT-Factory enables several canonical adapter integration schemas:

Serial Composition: Adapters follow core expert computation, $h \mapsto E_i(h) \mapsto \Delta_i(E_i(h))$ .
Parallel Composition: Adapters and experts process inputs independently and are summed post-residual: $h \mapsto E_i(h)$ and $\Delta_j(h) \Rightarrow$ sum.
Routed Mixtures: Both experts and adapters are selected via token-wise routing, $x = \sum_{i=1}^N G(h)_i\,E_i(h) + \sum_{j=1}^M \tilde G(h)_j\,\Delta_j(h) + h$ .

In dense models, selection reduces to adapter weighting; in MoE, it involves multiple learned logistic routers.

4. PERFT Framework for MoE

The Parameter-Efficient Routed Fine-Tuning (PERFT) framework is a concrete instantiation of the PEFT-Factory, specializing in MoE settings (Liu et al., 12 Nov 2024). PERFT generalizes adapter-based fine-tuning as follows:

Each MoE layer’s output augments the standard expert sum with a second sum over $M$ low-rank adapters, each weighted by a learned router $\tilde G$ .
PERFT-R ("Routed"): Adapters are LoRA modules, routed by a learned token-wise router.
- Forward: $x = \sum_{i=1}^N G(h)_i E_i(h) + \sum_{j=1}^M \tilde G(h)_j (h V_j) U_j + h$ .
PERFT-E ("Embedded"): Sets $M=N$ , ties $\tilde G$ to the main router $G$ .
PERFT-D/S ("Dense/Single"): No routing; all adapters are always active.

Adapters $\{U_j,V_j\}$ and router weights ( $W_{g,peft}$ ) are the only parameters trained, with pretrained MoE weights held fixed. The training objective augments cross-entropy loss with MoE load balancing and a nuclear norm regularizer to encourage sparsity in the low-rank adapters.

5. Multi-Profile PEFT with Adapter Banks (X-PEFT Integration)

For dense PLMs, PEFT-Factory systems such as X-PEFT (Kwak et al., 29 Jan 2024) leverage pre-collected adapter banks and learn extremely compact "mask" tensors per profile:

For $L$ transformer blocks and $N$ available adapters, each new profile $p$ involves learning only $M^A, M^B \in \mathbb{R}^{L \times N}$ (soft/real or hard/bit masks) to reweight or select adapters.
Forward pass per profile:
1. $h^{(0)} = \mathrm{Emb}(x)$ ;
2. for $\ell = 1 \ldots L$ : $h^{(\ell)} = \mathrm{TransformerBlock}_\theta^{(\ell)}(h^{(\ell-1)}) + \mathrm{Adapter}_p^{(\ell)}(h^{(\ell-1)})$ , where $\mathrm{Adapter}_p^{(\ell)}(\cdot) = \tilde B^{(\ell)} ( \tilde A^{(\ell)}(\cdot) )$ and $\tilde A^{(\ell)}, \tilde B^{(\ell)}$ are weighted adapter mixtures defined by $M^A[ℓ], M^B[ℓ]$ .
In the hard-mask case, at inference, only bit-masks are required (no floats), reducing profile storage requirements by factors of $10^4$ compared to classic adapters.

6. Training Objectives, Memory Analysis, and Practical Workflow

PEFT-Factory training freezes core model and (optionally) base adapters, optimizing only lightweight adapter params (e.g., LoRA $U_j,V_j$ or mask tensors). Typical losses include task cross-entropy, MoE load balancing, and an adapter-specific regularizer (e.g., nuclear norm). For multi-profile deployment, as in X-PEFT, only per-profile mask tensors and a task head require training and storage.

Memory efficiency is illustrated as follows (Kwak et al., 29 Jan 2024):

Full fine-tuning: ~110M params
Adapter tuning: $2 \cdot d \cdot b \cdot L$ (e.g., 1.18M params)
Soft-mask X-PEFT: $2 \cdot N \cdot L$ (e.g., 2.4K floats)
Hard-mask X-PEFT: $2 \cdot N \cdot L$ bits (e.g., 300B)

7. Empirical Results and Comparative Insights

Experiments on OLMoE-1B-7B and Mixtral-8×7B (MoE LLMs) across commonsense and arithmetic reasoning tasks show that PEFT-Factory’s routed adapters (PERFT-R) offer up to +17.2 percentage points in commonsense and +12.3 in arithmetic task accuracy over LoRA-attention baselines at matched active parameter counts (Liu et al., 12 Nov 2024). PERFT-E benefits from pretrained routers with many adapters. Non-routed (dense) adapters exhibit performance collapse at high bottleneck rank, underscoring the criticality of sparse, routed activation. Task-optimal adapter count and rank differ: commonsense tasks prefer fewer, overparameterized adapters, while arithmetic tasks benefit from more, smaller adapters.

X-PEFT demonstrates that hard-mask memory reductions of $10^4\times$ incur only minor accuracy drops (typically 2–5 points vs. adapter tuning) on LaMP, GLUE, and SuperGLUE (Kwak et al., 29 Jan 2024). This establishes PEFT-Factory as a practical architecture for scalable multi-task or multi-profile model serving at low computational and memory cost.

References:

[PERFT: Parameter-Efficient Routed Fine-Tuning for Mixture-of-Expert Model, (Liu et al., 12 Nov 2024)]
[X-PEFT: eXtremely Parameter-Efficient Fine-Tuning for Extreme Multi-Profile Scenarios, (Kwak et al., 29 Jan 2024)]