Papers
Topics
Authors
Recent
Search
2000 character limit reached

LlamaMLP Adapter Layer

Updated 17 April 2026
  • LlamaMLP Adapter Layer is a parameter-efficient module inserted within LLaMA’s MLP blocks, enabling fine-tuning by updating only a fraction of the model parameters.
  • It employs a simple bottleneck architecture with a down-projection, ReLU activation, and zero-initialized up-projection to ensure rapid adaptation without disrupting pre-trained weights.
  • Each adapter adds about 2.1M extra parameters per layer (around 1% for LLaMA-7B), balancing computational efficiency with high performance in downstream tasks.

LlamaMLP Adapter Layer refers to the insertion of lightweight, parameter-efficient adapter modules within the MLP (feed-forward) sub-layers of LLaMA, a variant of the Transformer architecture. As introduced and empirically evaluated in "LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of LLMs" (Hu et al., 2023), these adapters—implemented as either "Series adapters" or "Parallel adapters" (terminology from the original paper)—enable efficient fine-tuning by introducing a small bottleneck network per layer, allowing task adaptation by updating only a fraction of the model parameters while retaining most of the frozen LLM weights.

1. Architecture of MLP Adapter Layers

The LlamaMLP Adapter Layer modifies the standard MLP block found in the Transformer architecture. In each layer, the conventional feed-forward block operates as:

  • x→W1x+b1→GELU(â‹…)→W2â‹…+b2=yx \rightarrow W_1 x + b_1 \rightarrow \mathrm{GELU}(\cdot) \rightarrow W_2 \cdot + b_2 = y

Adapters are added as follows:

  • Series Adapter: Inserted after the MLP block's output, yy. The process is:
    • Down-project yy to a lower-dimensional "bottleneck" of size rr: U=Wdown y+bdownU = W_{\mathrm{down}}\,y + b_{\mathrm{down}}, where Wdown∈Rd×rW_{\mathrm{down}} \in \mathbb{R}^{d \times r}.
    • Pointwise nonlinearity: V=ReLU(U)V = \mathrm{ReLU}(U).
    • Up-project to the original dimension: Δ=Wup V+bup\Delta = W_{\mathrm{up}}\,V + b_{\mathrm{up}}, Wup∈Rr×dW_{\mathrm{up}} \in \mathbb{R}^{r \times d}.
    • Residual addition: y′=y+Δy^{\prime} = y + \Delta.
  • Parallel Adapter: Operates in parallel with the MLP block, taking the MLP input yy0:
    • Down-project yy1: yy2.
    • Nonlinearity and up-projection as above: yy3, yy4.
    • Summed with the output of the MLP block: yy5.

No auxiliary gating or sigmoid mechanisms are deployed in these adapters.

2. Placement Strategies in Transformer Blocks

Adapters are positioned within the Transformer encoder as follows:

  • The Series Adapter is appended immediately after the MLP sub-layer's output, i.e., after the yy6 operation.
  • The Parallel Adapter computes its output in parallel to the MLP, with the result added to the MLP output.

Ablation results demonstrated optimal performance for Series Adapters when placed post-MLP and for Parallel Adapters when run parallel to the MLP sub-layer, rather than after attention or elsewhere.

3. Mathematical Formulation

The transformations governed by each adapter configuration are as follows:

  • Series Adapter:

yy7

  • Parallel Adapter:

yy8

Typically, yy9 and yy0 are either set to zero or learned.

4. Hyperparameters and Initialization

The adapter bottleneck size yy1 is a critical configuration parameter. Grid search over yy2 identified yy3 as optimal for most tasks.

Hyperparameter choices:

  • Nonlinearity: ReLU
  • Weight Initialization:
    • yy4 with small yy5 (e.g., yy6)
    • yy7 initialized to zero, ensuring the adapter initially behaves as an identity function
    • yy8
  • No additional learnable scaling or gating is utilized beyond the two projection matrices.

This suggests that the adapters are initialized to prevent divergence from original model behavior at the start of fine-tuning.

5. Parameter Efficiency and Model Scale

Each adapter insertion results in yy9 extra parameters per MLP sub-layer.

A representative calculation for LLaMA-7B:

  • rr0
  • rr1
  • rr2 million parameters per layer.

Given 32 layers, the total adapter parameter count is approximately rr3 million, which is roughly rr4 of the full LLaMA-7B model's size.

Model Dimension rr5 Layers rr6 Params/Adapter Total Adapter Params Adapter % of Base
LLaMA-7B 4096 32 256 2.1M 67M 1%

This configuration enables most of the large model's capacity to be retained, while only a small fraction of the parameters are fine-tuned per task.

6. Empirical Performance and Ablation Studies

Adapter performance was validated on arithmetic and commonsense reasoning tasks using LLaMA models of different scales and various bottleneck sizes. Key findings:

  • Optimal Placement:
    • Series after MLP: rr7
    • Series after attention: rr8
    • Parallel in MLP: rr9
  • Bottleneck Sweep (LLaMA-7B):
U=Wdown y+bdownU = W_{\mathrm{down}}\,y + b_{\mathrm{down}}0 Series Adapter (%) Parallel Adapter (%)
64 57.3 59.1
128 59.2 60.8
256 59.5 61.7
512 56.6 58.0
  • Downstream Accuracy:
    • Arithmetic (LLaMA-7B + series@MLP, U=Wdown y+bdownU = W_{\mathrm{down}}\,y + b_{\mathrm{down}}1): U=Wdown y+bdownU = W_{\mathrm{down}}\,y + b_{\mathrm{down}}2 (benchmark: GPT-3.5 zero-shot CoT U=Wdown y+bdownU = W_{\mathrm{down}}\,y + b_{\mathrm{down}}3)
    • Arithmetic (LLaMA-13B + series): U=Wdown y+bdownU = W_{\mathrm{down}}\,y + b_{\mathrm{down}}4
    • Commonsense (LLaMA-7B + series@MLP): U=Wdown y+bdownU = W_{\mathrm{down}}\,y + b_{\mathrm{down}}5 (benchmark: ChatGPT zero-shot CoT U=Wdown y+bdownU = W_{\mathrm{down}}\,y + b_{\mathrm{down}}6)
    • Commonsense (LLaMA-13B + series@MLP): U=Wdown y+bdownU = W_{\mathrm{down}}\,y + b_{\mathrm{down}}7, exceeding ChatGPT (U=Wdown y+bdownU = W_{\mathrm{down}}\,y + b_{\mathrm{down}}8)

These results demonstrate that with an adapter bottleneck U=Wdown y+bdownU = W_{\mathrm{down}}\,y + b_{\mathrm{down}}9, approximately Wdown∈Rd×rW_{\mathrm{down}} \in \mathbb{R}^{d \times r}0 of parameters can be fine-tuned to recover Wdown∈Rd×rW_{\mathrm{down}} \in \mathbb{R}^{d \times r}1 of the performance of a Wdown∈Rd×rW_{\mathrm{down}} \in \mathbb{R}^{d \times r}2B-parameter model in math reasoning, whilst slightly exceeding strong baselines on commonsense reasoning (Hu et al., 2023).

7. Context, Significance, and Implications

LlamaMLP Adapter Layers exemplify parameter-efficient fine-tuning (PEFT) for LLMs, reducing compute costs and storage requirements for model specialization. Their simple bottleneck architecture—with no gating and zero-initialized up-projection—facilitates rapid convergence and avoids interference with pre-trained model capabilities.

Empirical evidence suggests that adapter-based PEFT, as instantiated in LLaMA-MLP adapters, is a scalable alternative to full fine-tuning for LLMs, particularly when downstream model adaptation must be performed efficiently for multiple tasks or client settings (Hu et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LlamaMLP Adapter Layer.