RevFFN: Memory-Efficient MoE Fine-Tuning

Updated 28 December 2025

The paper introduces a reversible architecture for MoE models, reducing peak activation memory by 49% during full-parameter fine-tuning.
RevFFN employs drop-in reversible blocks with projection adapters and cross-branch attention to integrate seamlessly with pre-trained MoE decoders.
It reconstructs input activations via a fixed-point iteration during backpropagation, ensuring precise lossless recovery with minimal extra computation.

RevFFN is a memory-efficient paradigm for full-parameter fine-tuning of large Mixture-of-Experts (MoE) Transformer-based LLMs, designed to mitigate peak activation memory bottlenecks inherent in standard backpropagation. The technique introduces reversible blocks wrapping pre-trained MoE decoder layers and leverages bijective mappings to reconstruct input activations from outputs during backward passes. This approach essentially halves peak activation memory requirements while preserving the expressive capacity and downstream task performance of MoE LLMs, thereby enabling full fine-tuning on a single consumer or server-grade GPU (Liu et al., 24 Dec 2025).

1. Architectural Design and Core Principles

RevFFN is implemented as a drop-in replacement for standard Transformer decoder layers within MoE LLM architectures. Each wrapped layer inherits pre-trained attention and MoE/MLP submodules, with the addition of a reversible scaffold and two lightweight linear "projection adapters." The input hidden state tensor $H \in \mathbb{R}^{B \times S \times d_\text{model}}$ is split along the feature axis into two equal streams, $X_1, X_2 \in \mathbb{R}^{B \times S \times d_\text{model}/2}$ .

The "left" stream, $X_1$ , is processed via a cross-branch multi-head attention mechanism, where queries stem from $X_1$ while keys and values are drawn from $X_2$ .
The "right" stream, $X_2$ , undergoes an MoE feed-forward transformation conditioned on the updated left stream.
Projection adapters $P_\uparrow:\mathbb{R}^{d_\text{model}/2} \rightarrow \mathbb{R}^{d_\text{model}}$ and $P_\downarrow:\mathbb{R}^{d_\text{model}} \rightarrow \mathbb{R}^{d_\text{model}/2}$ interface adapter-modified activations with original pre-trained parameters, ensuring all heavy computation occurs in the full $d_\text{model}$ space.

This reversible construction ensures a bijective forward mapping, enabling exact, machine-precision reconstruction of inputs during backpropagation and eliminating the need to store most intermediate activations.

2. Forward and Inverse Computation in RevFFN

Forward Pass

Let $\mathrm{Norm}(\cdot)$ denote layer normalization, $\mathrm{Attn}_{pt}$ and $\mathrm{MLP}_{pt}$ represent pre-trained attention and MoE/MLP blocks, respectively. The forward equations are:

$\begin{aligned} H &= [X_1, X_2],\quad X_1, X_2 \in \mathbb{R}^{B \times S \times d/2} \ Y_1 &= X_1 + \mathrm{Attn}(\mathrm{Norm}(X_1), \mathrm{Norm}(X_2), \mathrm{Norm}(X_2)) \ Y_2 &= X_2 + \mathrm{MLP}(\mathrm{Norm}(Y_1)) \ H_\text{out} &= [Y_1, Y_2] \end{aligned}$

where: $\mathrm{Attn}(Q, K, V) = P_\downarrow\left(\mathrm{Attn}_{pt}(P_\uparrow(Q), P_\uparrow(K), P_\uparrow(V))\right)$ and

$\mathrm{MLP}(U) = P_\downarrow\left(\mathrm{MLP}_{pt}(P_\uparrow(U))\right)$

Inversion and Activation Reconstruction

RevFFN achieves reversibility as follows:

$\hat{X}_2 = Y_2 - \mathrm{MLP}(\mathrm{Norm}(Y_1)) \$

$\hat{X}_1 = Y_1 - \mathrm{Attn}(\mathrm{Norm}(\hat{X}_1), \mathrm{Norm}(\hat{X}_2), \mathrm{Norm}(\hat{X}_2))$

In practical implementations, the inversion for $\hat{X}_1$ is computed using a single fixed-point iteration:

Initialize $U^0 = \mathrm{Norm}(Y_1)$ ,
Compute $U^1 = \mathrm{Norm}(Y_1 - \mathrm{Attn}(U^0, \mathrm{Norm}(\hat{X}_2), \mathrm{Norm}(\hat{X}_2)))$ ,
Set $\hat{X}_1 = Y_1 - \mathrm{Attn}(U^1, \mathrm{Norm}(\hat{X}_2), \mathrm{Norm}(\hat{X}_2))$ .

This ensures reconstruction error below machine epsilon with negligible computational overhead in the backward pass.

3. Integration with Mixture-of-Experts Architectures

For blocks containing MoE feed-forward networks instead of dense MLPs, $\mathrm{MLP}_{pt}$ is replaced by the original sparsely-gated expert mixture: $\mathrm{MLP}_{pt}(P_\uparrow(Y_1)) = \sum_{e \in \text{Top-k}(\mathrm{router}(P_\uparrow(Y_1)))} G_e(P_\uparrow(Y_1))$ where $\mathrm{router}$ and expert networks $G_e$ are intact and pre-trained, with routing frozen during fine-tuning. Top-k selection identifies the largest gating scores. Projection adapters wrap the MoE sublayer; the expert computation and gating remain unchanged, thus preserving the full expressive capacity of the MoE framework.

4. Memory Efficiency and Quantitative Comparison

RevFFN achieves substantial memory savings during fine-tuning. Conventional full fine-tuning with activation checkpointing on an NVIDIA H800 GPU (80 GB VRAM) peaks at 65.4 GB. RevFFN reduces peak activation usage to 39.5 GB (a 49% reduction). Competing approaches such as DeepSpeed ZeRO-3 and PyTorch FSDP lower per-GPU parameter state but do not reduce total activation memory, necessitating multi-GPU or host offloading.

Comparative peak activation memories are presented in the following table:

Technique	Peak Activation Memory	Memory Reduction
Standard checkpointing	65.4 GB	Baseline
GaLore	45.1 GB	31%
LoMo	42.2 GB	35%
RevFFN	39.5 GB	49%

Theoretical activation cost for an $L$ -layer baseline Transformer is $M_\text{base} \simeq O(B \cdot S \cdot d_\text{model} \cdot L)$ , whereas RevFFN yields $M_\text{rev} \simeq O(B \cdot S \cdot d_\text{model} + \epsilon)$ (one stream per layer, plus negligible adapter overhead).

5. Computational Trade-Offs and Practical Implementation

Each backward pass with RevFFN requires recomputation of the attention and MoE sublayers plus a single fixed-point iteration for inversion, reducing throughput compared to PEFT (parameter-efficient fine-tuning) methods. RevFFN achieves 24.6 samples/s (vs. LoRA's 75 samples/s and standard full-tuning checkpointing's 19.7 samples/s). Adapter parameter overhead is $O(d_\text{model}^2/2)$ per sublayer, which is negligible relative to backbone model size.

Stabilizing convergence typically involves a two-stage curriculum: initial adapter warm-up followed by joint fine-tuning with the MoE router frozen. RevFFN is compatible with any off-the-shelf Transformer/MoE layer without altering internal weights, enabling plug-and-play deployment for memory-constrained full fine-tuning.

6. Context and Significance Within Memory-Efficient Fine-Tuning Paradigms

RevFFN's reversible block methodology combines reversible network theory with the high-capacity design of modern MoE Transformers. In contrast to distributed activation offloading techniques, RevFFN locally reconstructs necessary activations during backpropagation, supporting true single-GPU full model fine-tuning without reducing model expressiveness. A plausible implication is broader accessibility of full fine-tuning for researchers limited by hardware resources, especially for tasks requiring adaptation of all parameters rather than just adapters or selected layers. The technique's operational simplicity and minimal parameter addition suggest applicability as a generalization to other reversible architectures in memory-constrained learning scenarios.

For further architectural details, empirical benchmarks, and implementation guidelines see the original work: "RevFFN: Memory-Efficient Full-Parameter Fine-Tuning of Mixture-of-Experts LLMs with Reversible Blocks" (Liu et al., 24 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

RevFFN: Memory-Efficient Full-Parameter Fine-Tuning of Mixture-of-Experts LLMs with Reversible Blocks (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to RevFFN Architecture.