Hadamard Adapter in Transformers

Updated 26 January 2026

Hadamard Adapter is a parameter-efficient module that fine-tunes pre-trained language models by applying element-wise linear transformations to self-attention outputs.
It is inserted between multi-head self-attention and LayerNorm layers, allowing only minimal adapter and select LayerNorm parameters to be updated while freezing the base model.
Empirical results on benchmarks like GLUE show that it achieves over 98% of full fine-tuning accuracy using only 0.022%-0.033% of total parameters.

The Hadamard Adapter is a parameter-efficient adapter module designed for fine-tuning pre-trained LLMs (PLMs) by augmenting their self-attention outputs via element-wise linear transformations. Characterized by its simplicity and minimal parameter footprint, the Hadamard Adapter maintains competitive downstream performance with only a minuscule fraction of trainable parameters compared to full fine-tuning and prior adapter-based methods (Chen et al., 2024).

1. Architectural Placement and Role in Transformer Layers

The Hadamard Adapter is injected into each Transformer layer immediately following the multi-head self-attention block but preceding the subsequent LayerNorm sub-layer, specifically placed before the intermediate LayerNorm. The core Transformer computations at layer $\ell$ proceed as usual:

$Q_\ell^h = Q W_\ell^Q$
$K_\ell^h = K W_\ell^K$
$V_\ell^h = V W_\ell^V$
$A_\ell = \text{Concat}_{h=1}^H \left( \text{softmax}\left(\frac{Q_\ell^h (K_\ell^h)^\top}{\sqrt{d_k}}\right) V_\ell^h \right )$

Instead of routing $A_\ell$ directly to the feed-forward (intermediate) block, the Hadamard Adapter applies an element-wise transformation before $A_\ell$ undergoes LayerNorm and subsequent processing. All base LM parameters are frozen; only the adapter and certain LayerNorm weights are updated during tuning.

2. Mathematical Formulation

At each layer $\ell$ , let $A_\ell \in \mathbb{R}^{B \times T \times D}$ denote the self-attention output, with batch size $B$ , sequence length $T$ , and hidden dimension $D$ . $A_\ell$ is reshaped to $(B T) \times D$ (denoted $A_\ell^\text{flat}$ ) and then transformed via:

$A_\ell^\text{flat'} = W^{(\ell)} \odot A_\ell^\text{flat} + b^{(\ell)}$

where

$W^{(\ell)} \in \mathbb{R}^D$ is a learnable scaling vector (initialized to all-ones),
$b^{(\ell)} \in \mathbb{R}^D$ is a learnable bias vector (initialized to all-zeros),
$\odot$ denotes Hadamard product with broadcasting over the rows.

After the transformation, $A_\ell^\text{flat'}$ is reshaped back to $B \times T \times D$ as $A_\ell'$ , and forwarded through the remaining layer modules. In short,

$A_\ell' = \text{reshape}\big(W^{(\ell)} \odot \text{reshape}(A_\ell) + b^{(\ell)}\big)$

No other Transformer weights (Q/K/V projections, output matrices, feed-forward layers) are updated.

3. Parameter Analysis and Comparative Efficiency

The Hadamard Adapter's parameter count constitutes a significant reduction over prior methods:

Model	Parameters Tuned (Adapter Only + LayerNorm)	% of Full Model
BERT-base	36,864	0.033%
BERT-large	analogous scaling	—
Standard Adapter	0.5–2M	1–2%
LoRA	~220k–330k	0.2–0.3%
BitFit	—	0.09%
Hadamard Adapter	3–8× fewer than LoRA/BitFit	—

Per layer, the adapter adds $D$ (scaling) + $D$ (bias) parameters; for BERT-base ( $D=768$ ), this amounts to 1,536 parameters per layer, or 18,432 for 12 layers. Addition of two $D$ -dim LayerNorm vectors per layer leads to a total of 36,864 parameters, matching 0.033% of BERT-base's 110M parameters. Further reduction is achieved by pruning adapter layers (Section 6), yielding configurations with only 0.022% parameters.

Analyses of adapters trained across a diverse set of GLUE tasks reveal high inter-task similarity for certain components:

Adapter scaling vectors ( $W^{(\ell)}$ ): Cosine similarity of 0.98–0.99 across tasks indicates that these gates are nearly universal and may be shared across tasks.
Adapter bias vectors ( $b^{(\ell)}$ ): Cosine similarity drops to 0.2–0.3, suggesting task-specific adaptation is necessary.
LayerNorm gains/biases: Lower layers exhibit high similarity across tasks, but middle and upper layers diverge.

Recommended Tuning Patterns:

Share $W^{(\ell)}$ globally, learning only $b^{(\ell)}$ per task, halving adapter parameter count.
Freeze early adapter layers ( $\ell = 1 \ldots 4$ ) and tune only $\ell = 5 \ldots 12$ .
Reuse LayerNorm gain across $\ell = 1 \ldots 6$ and task-specify for $\ell = 7 \ldots 12$ . A configuration with these patterns reduces trainables to fewer than half the adapter layers and only the top biases/LayerNorms (0.022% of parameters), with minimal accuracy drop.

5. Benchmarking and Empirical Results

Fine-tuning experiments on the GLUE benchmark utilized BERT-base (110M parameters) under the following protocol:

Stage 1: Full model frozen; train only pooling and classifier head for 3–5 epochs at LR $\sim 3 \times 10^{-3}$ .
Stage 2: Insert Hadamard Adapter after every self-attention; unfreeze adapter $W,b$ and intermediate LayerNorm's gain/bias; train for 20 epochs at LR $\sim 5\times 10^{-4}$ .

Metrics: Classification accuracy (all except CoLA, STS-B), Matthews correlation (CoLA), Pearson correlation (STS-B).

Model	Train	MRPC	CoLA	MNLI	QNLI	QQP	RTE	SST-2	STS-B	Avg
full	fine-tgt	89.4	56.5	83.9	91.3	87.5	64.6	93.0	88.6	81.9
adapter	adapter	90.2	58.4	80.4	89.7	85.9	71.9	92.4	88.5	82.2
classifier only	classifier only	71.8	37.0	54.4	70.6	79.3	57.4	87.4	60.3	64.8

The Hadamard Adapter recovers 99.4% of the full fine-tuning performance while training only 0.033% of the parameters. Similar trends are established across other PLMs including RoBERTa, BART, DeBERTa, and ELECTRA, and when compared to BitFit and LoRA the Hadamard Adapter achieves equivalent accuracy with substantially fewer parameters.

6. Pruning Redundant Adapter Layers

Empirical layer ablation studies indicate that tuning only the upper $K$ adapter layers (with associated LayerNorms) suffices for maximal performance:

For BERT-base, $K = 8$ (layers 5–12) achieves parity with full-layer tuning ( $K=12$ ); $K<8$ incurs $>$ 1–2 point loss.
For larger PLMs, $K \approx 50\%$ of layers suffices.

The resulting configuration—in which adapter modules in the bottom layers are discarded (i.e., $W,b$ frozen at $1,0$)—reduces total trainables to 24,576 parameters ( $\sim 0.022\%$ ) while maintaining $>$ 98% of full fine-tuning accuracy.

7. Implementation Recommendations and Best Practices

Key operational guidelines include:

Employ two-stage training (classifier head followed by adapter+LayerNorm), as joint tuning yields lower performance by 0.5–1 point.
Initialize adapter scaling weights to ones and biases to zeros for identity transformation at start.
Adopt small learning rates ( $1\times10^{-4}$ – $5\times10^{-4}$ ) for adapter/LayerNorm, higher ( $3\times10^{-3}$ – $5\times10^{-3}$ ) for classifier head.
Unfreeze only the intermediate LayerNorm, not the attention-output LayerNorm.
For multi-task or continual learning scenarios, share scaling gates across tasks, keeping biases task-specific to halve adapter parameters.
Optionally freeze adapters in lower layers where cross-task behavior is invariant.

Collectively, the Hadamard Adapter offers maximal parameter efficiency, requiring only $2D$ parameters per layer (scaling and bias vectors) yet reproducing nearly all the functional capacity of full-model fine-tuning across standard language understanding benchmarks (Chen et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Hadamard Adapter: An Extreme Parameter-Efficient Adapter Tuning Method for Pre-trained Language Models (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hadamard Adapter.