LlamaMLP Adapter Layer
- LlamaMLP Adapter Layer is a parameter-efficient module inserted within LLaMA’s MLP blocks, enabling fine-tuning by updating only a fraction of the model parameters.
- It employs a simple bottleneck architecture with a down-projection, ReLU activation, and zero-initialized up-projection to ensure rapid adaptation without disrupting pre-trained weights.
- Each adapter adds about 2.1M extra parameters per layer (around 1% for LLaMA-7B), balancing computational efficiency with high performance in downstream tasks.
LlamaMLP Adapter Layer refers to the insertion of lightweight, parameter-efficient adapter modules within the MLP (feed-forward) sub-layers of LLaMA, a variant of the Transformer architecture. As introduced and empirically evaluated in "LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of LLMs" (Hu et al., 2023), these adapters—implemented as either "Series adapters" or "Parallel adapters" (terminology from the original paper)—enable efficient fine-tuning by introducing a small bottleneck network per layer, allowing task adaptation by updating only a fraction of the model parameters while retaining most of the frozen LLM weights.
1. Architecture of MLP Adapter Layers
The LlamaMLP Adapter Layer modifies the standard MLP block found in the Transformer architecture. In each layer, the conventional feed-forward block operates as:
Adapters are added as follows:
- Series Adapter: Inserted after the MLP block's output, . The process is:
- Down-project to a lower-dimensional "bottleneck" of size : , where .
- Pointwise nonlinearity: .
- Up-project to the original dimension: , .
- Residual addition: .
- Parallel Adapter: Operates in parallel with the MLP block, taking the MLP input 0:
- Down-project 1: 2.
- Nonlinearity and up-projection as above: 3, 4.
- Summed with the output of the MLP block: 5.
No auxiliary gating or sigmoid mechanisms are deployed in these adapters.
2. Placement Strategies in Transformer Blocks
Adapters are positioned within the Transformer encoder as follows:
- The Series Adapter is appended immediately after the MLP sub-layer's output, i.e., after the 6 operation.
- The Parallel Adapter computes its output in parallel to the MLP, with the result added to the MLP output.
Ablation results demonstrated optimal performance for Series Adapters when placed post-MLP and for Parallel Adapters when run parallel to the MLP sub-layer, rather than after attention or elsewhere.
3. Mathematical Formulation
The transformations governed by each adapter configuration are as follows:
- Series Adapter:
7
- Parallel Adapter:
8
Typically, 9 and 0 are either set to zero or learned.
4. Hyperparameters and Initialization
The adapter bottleneck size 1 is a critical configuration parameter. Grid search over 2 identified 3 as optimal for most tasks.
Hyperparameter choices:
- Nonlinearity: ReLU
- Weight Initialization:
- 4 with small 5 (e.g., 6)
- 7 initialized to zero, ensuring the adapter initially behaves as an identity function
- 8
- No additional learnable scaling or gating is utilized beyond the two projection matrices.
This suggests that the adapters are initialized to prevent divergence from original model behavior at the start of fine-tuning.
5. Parameter Efficiency and Model Scale
Each adapter insertion results in 9 extra parameters per MLP sub-layer.
A representative calculation for LLaMA-7B:
- 0
- 1
- 2 million parameters per layer.
Given 32 layers, the total adapter parameter count is approximately 3 million, which is roughly 4 of the full LLaMA-7B model's size.
| Model | Dimension 5 | Layers | 6 | Params/Adapter | Total Adapter Params | Adapter % of Base |
|---|---|---|---|---|---|---|
| LLaMA-7B | 4096 | 32 | 256 | 2.1M | 67M | 1% |
This configuration enables most of the large model's capacity to be retained, while only a small fraction of the parameters are fine-tuned per task.
6. Empirical Performance and Ablation Studies
Adapter performance was validated on arithmetic and commonsense reasoning tasks using LLaMA models of different scales and various bottleneck sizes. Key findings:
- Optimal Placement:
- Series after MLP: 7
- Series after attention: 8
- Parallel in MLP: 9
- Bottleneck Sweep (LLaMA-7B):
| 0 | Series Adapter (%) | Parallel Adapter (%) |
|---|---|---|
| 64 | 57.3 | 59.1 |
| 128 | 59.2 | 60.8 |
| 256 | 59.5 | 61.7 |
| 512 | 56.6 | 58.0 |
- Downstream Accuracy:
- Arithmetic (LLaMA-7B + series@MLP, 1): 2 (benchmark: GPT-3.5 zero-shot CoT 3)
- Arithmetic (LLaMA-13B + series): 4
- Commonsense (LLaMA-7B + series@MLP): 5 (benchmark: ChatGPT zero-shot CoT 6)
- Commonsense (LLaMA-13B + series@MLP): 7, exceeding ChatGPT (8)
These results demonstrate that with an adapter bottleneck 9, approximately 0 of parameters can be fine-tuned to recover 1 of the performance of a 2B-parameter model in math reasoning, whilst slightly exceeding strong baselines on commonsense reasoning (Hu et al., 2023).
7. Context, Significance, and Implications
LlamaMLP Adapter Layers exemplify parameter-efficient fine-tuning (PEFT) for LLMs, reducing compute costs and storage requirements for model specialization. Their simple bottleneck architecture—with no gating and zero-initialized up-projection—facilitates rapid convergence and avoids interference with pre-trained model capabilities.
Empirical evidence suggests that adapter-based PEFT, as instantiated in LLaMA-MLP adapters, is a scalable alternative to full fine-tuning for LLMs, particularly when downstream model adaptation must be performed efficiently for multiple tasks or client settings (Hu et al., 2023).