Hadamard Adapter in Transformers
- Hadamard Adapter is a parameter-efficient module that fine-tunes pre-trained language models by applying element-wise linear transformations to self-attention outputs.
- It is inserted between multi-head self-attention and LayerNorm layers, allowing only minimal adapter and select LayerNorm parameters to be updated while freezing the base model.
- Empirical results on benchmarks like GLUE show that it achieves over 98% of full fine-tuning accuracy using only 0.022%-0.033% of total parameters.
The Hadamard Adapter is a parameter-efficient adapter module designed for fine-tuning pre-trained LLMs (PLMs) by augmenting their self-attention outputs via element-wise linear transformations. Characterized by its simplicity and minimal parameter footprint, the Hadamard Adapter maintains competitive downstream performance with only a minuscule fraction of trainable parameters compared to full fine-tuning and prior adapter-based methods (Chen et al., 2024).
1. Architectural Placement and Role in Transformer Layers
The Hadamard Adapter is injected into each Transformer layer immediately following the multi-head self-attention block but preceding the subsequent LayerNorm sub-layer, specifically placed before the intermediate LayerNorm. The core Transformer computations at layer proceed as usual:
Instead of routing directly to the feed-forward (intermediate) block, the Hadamard Adapter applies an element-wise transformation before undergoes LayerNorm and subsequent processing. All base LM parameters are frozen; only the adapter and certain LayerNorm weights are updated during tuning.
2. Mathematical Formulation
At each layer , let denote the self-attention output, with batch size , sequence length , and hidden dimension . is reshaped to (denoted ) and then transformed via:
where
- is a learnable scaling vector (initialized to all-ones),
- is a learnable bias vector (initialized to all-zeros),
- denotes Hadamard product with broadcasting over the rows.
After the transformation, is reshaped back to as , and forwarded through the remaining layer modules. In short,
No other Transformer weights (Q/K/V projections, output matrices, feed-forward layers) are updated.
3. Parameter Analysis and Comparative Efficiency
The Hadamard Adapter's parameter count constitutes a significant reduction over prior methods:
| Model | Parameters Tuned (Adapter Only + LayerNorm) | % of Full Model |
|---|---|---|
| BERT-base | 36,864 | 0.033% |
| BERT-large | analogous scaling | — |
| Standard Adapter | 0.5–2M | 1–2% |
| LoRA | ~220k–330k | 0.2–0.3% |
| BitFit | — | 0.09% |
| Hadamard Adapter | 3–8× fewer than LoRA/BitFit | — |
Per layer, the adapter adds (scaling) + (bias) parameters; for BERT-base (), this amounts to 1,536 parameters per layer, or 18,432 for 12 layers. Addition of two -dim LayerNorm vectors per layer leads to a total of 36,864 parameters, matching 0.033% of BERT-base's 110M parameters. Further reduction is achieved by pruning adapter layers (Section 6), yielding configurations with only 0.022% parameters.
4. Task Similarity, Adapter Sharing, and Tuning Patterns
Analyses of adapters trained across a diverse set of GLUE tasks reveal high inter-task similarity for certain components:
- Adapter scaling vectors (): Cosine similarity of 0.98–0.99 across tasks indicates that these gates are nearly universal and may be shared across tasks.
- Adapter bias vectors (): Cosine similarity drops to 0.2–0.3, suggesting task-specific adaptation is necessary.
- LayerNorm gains/biases: Lower layers exhibit high similarity across tasks, but middle and upper layers diverge.
Recommended Tuning Patterns:
- Share globally, learning only per task, halving adapter parameter count.
- Freeze early adapter layers () and tune only .
- Reuse LayerNorm gain across and task-specify for . A configuration with these patterns reduces trainables to fewer than half the adapter layers and only the top biases/LayerNorms (0.022% of parameters), with minimal accuracy drop.
5. Benchmarking and Empirical Results
Fine-tuning experiments on the GLUE benchmark utilized BERT-base (110M parameters) under the following protocol:
- Stage 1: Full model frozen; train only pooling and classifier head for 3–5 epochs at LR .
- Stage 2: Insert Hadamard Adapter after every self-attention; unfreeze adapter and intermediate LayerNorm's gain/bias; train for 20 epochs at LR .
Metrics: Classification accuracy (all except CoLA, STS-B), Matthews correlation (CoLA), Pearson correlation (STS-B).
| Model | Train | MRPC | CoLA | MNLI | QNLI | QQP | RTE | SST-2 | STS-B | Avg |
|---|---|---|---|---|---|---|---|---|---|---|
| full | fine-tgt | 89.4 | 56.5 | 83.9 | 91.3 | 87.5 | 64.6 | 93.0 | 88.6 | 81.9 |
| adapter | adapter | 90.2 | 58.4 | 80.4 | 89.7 | 85.9 | 71.9 | 92.4 | 88.5 | 82.2 |
| classifier only | classifier only | 71.8 | 37.0 | 54.4 | 70.6 | 79.3 | 57.4 | 87.4 | 60.3 | 64.8 |
The Hadamard Adapter recovers 99.4% of the full fine-tuning performance while training only 0.033% of the parameters. Similar trends are established across other PLMs including RoBERTa, BART, DeBERTa, and ELECTRA, and when compared to BitFit and LoRA the Hadamard Adapter achieves equivalent accuracy with substantially fewer parameters.
6. Pruning Redundant Adapter Layers
Empirical layer ablation studies indicate that tuning only the upper adapter layers (with associated LayerNorms) suffices for maximal performance:
- For BERT-base, (layers 5–12) achieves parity with full-layer tuning (); incurs 1–2 point loss.
- For larger PLMs, of layers suffices.
The resulting configuration—in which adapter modules in the bottom layers are discarded (i.e., frozen at $1,0$)—reduces total trainables to 24,576 parameters () while maintaining 98% of full fine-tuning accuracy.
7. Implementation Recommendations and Best Practices
Key operational guidelines include:
- Employ two-stage training (classifier head followed by adapter+LayerNorm), as joint tuning yields lower performance by 0.5–1 point.
- Initialize adapter scaling weights to ones and biases to zeros for identity transformation at start.
- Adopt small learning rates (–) for adapter/LayerNorm, higher (–) for classifier head.
- Unfreeze only the intermediate LayerNorm, not the attention-output LayerNorm.
- For multi-task or continual learning scenarios, share scaling gates across tasks, keeping biases task-specific to halve adapter parameters.
- Optionally freeze adapters in lower layers where cross-task behavior is invariant.
Collectively, the Hadamard Adapter offers maximal parameter efficiency, requiring only $2D$ parameters per layer (scaling and bias vectors) yet reproducing nearly all the functional capacity of full-model fine-tuning across standard language understanding benchmarks (Chen et al., 2024).