RevFFN: Memory-Efficient Fine-Tuning for MoE LLMs

Updated 16 January 2026

The paper presents RevFFN, a memory-efficient paradigm that reduces peak VRAM by ~49% using reversible Transformer blocks.
It details a reversible MoE construction that enables exact input reconstruction during backpropagation, eliminating the need to store intermediate activations.
Empirical results demonstrate that RevFFN maintains or slightly improves task performance while lowering peak VRAM from 65.4GB to 39.5GB.

RevFFN is a memory-efficient paradigm for full-parameter fine-tuning of Mixture-of-Experts (MoE) LLMs utilizing reversible Transformer block architectures. It addresses the activation memory bottleneck inherent in conventional fine-tuning approaches by enabling input reconstruction from outputs during the backward pass, thereby eliminating the need to store intermediate activations. This mechanism significantly reduces peak VRAM requirements and enables single-GPU training for large-scale MoE LLMs without sacrificing expressive capacity or downstream performance (Liu et al., 24 Dec 2025).

1. Architectural Principles

1.1 Standard MoE Transformer Layer

Traditional Transformer decoder layers consist of two sublayers: multi-head self-attention with residual connections and a feed-forward network (FFN). The residual formulation is:

Self-attention: $H' = H + \mathrm{Attn}(\mathrm{LN}(H), \mathrm{LN}(H), \mathrm{LN}(H))$
Feed-forward: $H_{out} = H' + \mathrm{FFN}(\mathrm{LN}(H'))$ , with $\mathrm{FFN}(x) = W_2 \sigma(W_1 x)$

The MoE variant replaces the FFN with a sparsely-gated expert layer. A gating network $g(x) = \mathrm{softmax}(W_g x) \in \mathbb{R}^E$ assigns each token to the top- $k$ experts, with individual two-layer MLPs per expert $E_e(x) = W_2^{(e)} \sigma(W_1^{(e)} x)$ , aggregated as $F_{MoE}(x) = \sum_{e=1}^E g_e(x) E_e(x)$ . All computation occurs in the full model dimension $d_{model}$ .

1.2 Reversibility in Residual Blocks

Conventional residual blocks require caching input activations for backpropagation, incurring $O(LBSd_{model})$ memory for $L$ layers, batch size $H_{out} = H' + \mathrm{FFN}(\mathrm{LN}(H'))$ 0, sequence length $H_{out} = H' + \mathrm{FFN}(\mathrm{LN}(H'))$ 1, and model dimension $H_{out} = H' + \mathrm{FFN}(\mathrm{LN}(H'))$ 2. A reversible block implements a bijective mapping $H_{out} = H' + \mathrm{FFN}(\mathrm{LN}(H'))$ 3, enabling input reconstruction during the backward pass:

Forward: $H_{out} = H' + \mathrm{FFN}(\mathrm{LN}(H'))$ 4
Inverse: $H_{out} = H' + \mathrm{FFN}(\mathrm{LN}(H'))$ 5

RevFFN employs a two-stream coupling method, splitting activations into halves and ensuring exact invertibility.

2. Reversible MoE Block Construction

2.1 Formulation

The hidden tensor $H_{out} = H' + \mathrm{FFN}(\mathrm{LN}(H'))$ 6 is partitioned as $H_{out} = H' + \mathrm{FFN}(\mathrm{LN}(H'))$ 7 with $H_{out} = H' + \mathrm{FFN}(\mathrm{LN}(H'))$ 8. The reversible update equations for a decoder layer are:

$H_{out} = H' + \mathrm{FFN}(\mathrm{LN}(H'))$ 9
$\mathrm{FFN}(x) = W_2 \sigma(W_1 x)$ 0
$\mathrm{FFN}(x) = W_2 \sigma(W_1 x)$ 1

Inverse mapping is defined as:

$\mathrm{FFN}(x) = W_2 \sigma(W_1 x)$ 2
$\mathrm{FFN}(x) = W_2 \sigma(W_1 x)$ 3

A single fixed-point iteration initialized at $\mathrm{FFN}(x) = W_2 \sigma(W_1 x)$ 4 achieves machine-precision convergence.

2.2 MoE Feed-Forward Layer Structure

For $\mathrm{FFN}(x) = W_2 \sigma(W_1 x)$ 5:

Routing: $\mathrm{FFN}(x) = W_2 \sigma(W_1 x)$ 6, $\mathrm{FFN}(x) = W_2 \sigma(W_1 x)$ 7
Experts: $\mathrm{FFN}(x) = W_2 \sigma(W_1 x)$ 8
Aggregation: $\mathrm{FFN}(x) = W_2 \sigma(W_1 x)$ 9

To maintain compatibility with pre-trained MoE modules, inputs are projected via adapter matrices $g(x) = \mathrm{softmax}(W_g x) \in \mathbb{R}^E$ 0 and $g(x) = \mathrm{softmax}(W_g x) \in \mathbb{R}^E$ 1, yielding $g(x) = \mathrm{softmax}(W_g x) \in \mathbb{R}^E$ 2.

3. Memory Savings and Activation Reconstruction

3.1 Back-Propagation Strategy

Standard layers require storing $g(x) = \mathrm{softmax}(W_g x) \in \mathbb{R}^E$ 3, and LayerNorm inputs for gradient calculations. In RevFFN, the backward pass proceeds as:

Reconstruct $g(x) = \mathrm{softmax}(W_g x) \in \mathbb{R}^E$ 4 from $g(x) = \mathrm{softmax}(W_g x) \in \mathbb{R}^E$ 5 using the inverse mapping.
Re-execute LayerNorm, Attention, and MoE blocks to recreate required intermediates.
Compute gradients with respect to parameters and inputs using chain-rule.

Each sublayer is executed twice per step (forward and backward), trading memory savings for compute overhead.

3.2 Memory Complexity

Method	Memory Complexity
Standard fine-tuning	$g(x) = \mathrm{softmax}(W_g x) \in \mathbb{R}^E$ 6
RevFFN reversible architecture	$g(x) = \mathrm{softmax}(W_g x) \in \mathbb{R}^E$ 7

RevFFN eliminates the dependence on layer count $g(x) = \mathrm{softmax}(W_g x) \in \mathbb{R}^E$ 8—activation memory scales only with batch size, sequence length, and model dimension.

4. Training Modifications and Performance

4.1 Backward Hook Implementation

RevFFN requires a custom backward hook:

At each reversible block, output activations $g(x) = \mathrm{softmax}(W_g x) \in \mathbb{R}^E$ 9 are popped.
Inverse mapping reconstructs $k$ 0.
Forward sublayers are re-executed to materialize intermediates (LayerNorm, Attn, MoE).
Autograd applies the chain-rule for gradients with respect to model parameters and inputs.

Only expert parameters $k$ 1 and adapters $k$ 2 are updated; the gating network remains frozen during fine-tuning.

4.2 Computational Overhead

Each reversible layer incurs roughly 2 $k$ 3 FLOPs compared to the standard layer (due to re-execution in backward). In practice, MoE computation is dominant, resulting in $k$ 4 20–30% training overhead:

Throughput drops from 31.0 to 24.6 samples/s on NVIDIA H800.

Method	Peak VRAM (GB)	Throughput (samples/s)
SFT + Checkpointing	65.4	19.7
GaLore	45.1	35.2
RevFFN	39.5	24.6

5. Empirical Validation

Downstream task performance is evaluated on MMLU, GSM8K, MT-Bench, and a multilingual benchmark:

Method	MMLU	GSM8K	Multilingual	MT-Bench
SFT + Checkpointing	66.1%	74.8%	39.5%	7.52
RevFFN	66.7%	75.1%	38.8%	7.65

RevFFN provides a $k$ 5 49% reduction in peak memory (vs. SFT+Checkpointing), with task accuracy matching or slightly exceeding baseline methods. Ablation studies confirm that both stages of the two-stage schedule are essential for training stability and optimal performance.

6. Usage Scenarios and Implementation Recommendations

6.1 Application Context

RevFFN is indicated in settings with VRAM constraints under $k$ 680GB and model scales ranging from several to tens of billions of parameters. It is suited for scenarios where full-parameter adaptation is required and multi-GPU or CPU offloading is infeasible or suboptimal.

6.2 PyTorch Implementation

The reversible block is constructed as follows:

$E_e(x) = W_2^{(e)} \sigma(W_1^{(e)} x)$ 0

Backward hooks must be registered to:

Pop output activations,
Run inverse mapping,
Re-execute submodules to recover intermediates,
Apply autograd for parameter gradients.

7. Summary and Implications

RevFFN refactors Transformer decoder layers into reversible two-stream blocks, maintaining full MoE routing and expert computation across the model dimension. It achieves comparable downstream task performance to standard full fine-tuning while providing significant reductions in peak memory usage—enabling practical, single-GPU training of billion-parameter MoE LLMs by trading a moderate computational overhead (%%%%47 $\mathrm{FFN}(x) = W_2 \sigma(W_1 x)$ 48%%%% per block) for an effective 2 $k$ 9 reduction in activation memory (Liu et al., 24 Dec 2025). This suggests broader applicability of reversible computation techniques for efficient adaptation of modern LLM architectures lacking distributed infrastructure.

Markdown Report Issue Upgrade to Chat

References (1)

RevFFN: Memory-Efficient Full-Parameter Fine-Tuning of Mixture-of-Experts LLMs with Reversible Blocks (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reversible FFN for MoE LLMs.