LLaMA-Excitor: Parameter-Efficient Fine-Tuning

Updated 18 April 2026

LLaMA-Excitor is a PEFT approach that enhances LLaMA models by inserting trainable Excitor blocks into the self-attention path for targeted feature reweighting.
It updates only around 7.6M parameters (<0.1% of the model) while keeping the base weights frozen, ensuring robust retention of pre-trained knowledge.
Empirical evaluations demonstrate improved accuracy on language and vision benchmarks with low computational overhead and strong cross-dataset consistency.

LLaMA-Excitor is a parameter-efficient fine-tuning (PEFT) method explicitly designed to adapt pre-trained LLaMA LLMs by modulating self-attention mechanisms through indirect feature interaction. It achieves this by inserting a lightweight, trainable bypass module—the Excitor block—into the attention computation, allowing targeted up-weighting of salient input information without direct modification of hidden representations or base model parameters. LLaMA-Excitor has demonstrated strong performance in both language-only and multimodal (vision–language) instruction following tasks, with empirical evidence of enhanced task-specific capacity while preserving pre-trained model knowledge and minimizing parameter and compute overhead (Zou et al., 2024, Abdullah et al., 14 Oct 2025).

1. Mechanism and Architectural Integration

LLaMA-Excitor modifies the self-attention computation in Transformer decoder layers by introducing an Excitor block operating in parallel to the frozen attention path. At each layer $l$ of a frozen LLaMA model:

Input token embeddings $T_l \in \mathbb{R}^{M\times C}$ are projected to Query, Key, and Value matrices via original, frozen weights:

$\text{Query} = W_q(T_l),\quad \text{Key} = W_k(T_l),\quad \text{Value} = W_v(T_l)$

Excitor injects a learnable set of prompts $P_l \in \mathbb{R}^{K\times C}$ and a scalar gate $g_l$ . It reconstructs an auxiliary key, $\mathrm{Key}_{\text{extra}}$ , from $P_l$ and projects the input embeddings via a low-rank bottleneck.

The Excitor block outputs an auxiliary similarity matrix

$S_l^{\text{extra}} = (\text{Query}\cdot\mathrm{Key}_{\text{extra}}^T)/\sqrt{C}$

which is gated and added to the original similarity $S_l=(\text{Query}\cdot\text{Key}^T)/\sqrt{C}$ , to form the final augmented attention: $S_l^g = \mathrm{Softmax} \bigl( S_l + g_l \cdot S_l^{\text{extra}} \bigr)$ The post-attention output is

$T_l \in \mathbb{R}^{M\times C}$ 0

Crucially, the value vectors themselves remain unchanged; only their dynamic mixing, as determined by the attended tokens, is altered. This allows indirect feature interaction: the Excitor module does not inject new hidden states but selectively re-weights how frozen representations are used (Zou et al., 2024).

2. Training Methodology

LLaMA-Excitor is trained by updating only the Excitor block parameters (per-layer prompts $T_l \in \mathbb{R}^{M\times C}$ 1, bottleneck projections $T_l \in \mathbb{R}^{M\times C}$ 2, and gates $T_l \in \mathbb{R}^{M\times C}$ 3), with all base LLaMA weights strictly frozen. The following datasets and hyperparameters were used for empirical evaluation:

Language-only tuning: Stanford Alpaca (52K machine-generated instruction–response pairs).
Multimodal tuning: MSCOCO (0.6M image captioning pairs), LLaVA665k (0.66M visual instruction-following pairs), ScienceQA (21K multimodal multiple-choice questions).

Optimization settings included 5 training epochs on 8×A100 GPUs, a batch size of 64, a learning rate of $T_l \in \mathbb{R}^{M\times C}$ 4, weight decay 0.02, and decoding with top- $T_l \in \mathbb{R}^{M\times C}$ 5, temperature=0.1. The total trainable parameter count for a 30-layer LLaMA-7B configuration ( $T_l \in \mathbb{R}^{M\times C}$ 6, $T_l \in \mathbb{R}^{M\times C}$ 7, $T_l \in \mathbb{R}^{M\times C}$ 8) is approximately 7.6M, less than 0.1% of the model, which compares favorably to adapter or LoRA alternatives (Zou et al., 2024, Abdullah et al., 14 Oct 2025).

3. Preservation of Base Model Knowledge

A defining attribute of LLaMA-Excitor is empirical preservation—or improvement—of base model capabilities. In contrast to adapter and LoRA PEFT methods, which typically degrade pre-trained LLM accuracy on out-of-domain or general tasks, LLaMA-Excitor demonstrated:

Zero or positive retention on MMLU: fine-tuned LLaMA-Excitor-7B on Alpaca-52K exhibited a +3.12% MMLU accuracy increase (from ~35.1% to ~38.2%), whereas Adapter and LoRA variants dropped by 3–6% (Zou et al., 2024, Abdullah et al., 14 Oct 2025).
Cross-dataset consistency: post-finetuning accuracy on ARC, HellaSwag, TruthfulQA, and MMLU remained within ±1%. Competing methods exhibited substantial performance deterioration.
Ablations in the Excitor projection path demonstrated that best results occurred when prompts were used as keys/values without additional projection and that modifying the token or prompt projection order could adversely affect retention.

These findings substantiate LLaMA-Excitor's core hypothesis that biasing attention, rather than directly perturbing hidden representations, confers robust out-of-distribution generalization and resistance to catastrophic forgetting.

4. Performance on Evaluation Benchmarks

LLaMA-Excitor achieved competitive or state-of-the-art results across several benchmarks:

Task/Dataset	Metric	Excitor Score	Notable Comparison
MMLU (language)	mAcc	+3.12% over baseline	Adapter/LoRA: 3–6% drop
MS-COCO Image Caption	CIDEr	157.5	BLIP-2: 145.3
ScienceQA	Acc.	88.39% (w/ LoRA&CLIP)	LLaVA (13B): 90.92%
VQA-v2	Acc.	83.6%	+3.6% over prior SOTA
GQA	Acc.	62.1%	Competitive with SOTA

The MS-COCO result is notable: despite lacking vision–language pretraining or alignment modules, Excitor outperformed prior models (Flamingo, mPLUG-Owl2, BLIP-2) by as much as 12.2 CIDEr (Zou et al., 2024). ScienceQA accuracy was comparable to or better than much larger and more extensively trained models. Similar efficiency was observed on VQA-v2 and GQA.

5. Parameter and Computational Efficiency

With total trainable parameters $T_l \in \mathbb{R}^{M\times C}$ 9M ( $\text{Query} = W_q(T_l),\quad \text{Key} = W_k(T_l),\quad \text{Value} = W_v(T_l)$ 0 for LLaMA-7B), LLaMA-Excitor is highly parameter-efficient. Comparisons from the original studies:

Method	Trainable Params (LLaMA-7B)	Overhead (%)
Excitor	~7.6M	<0.1
Prefix-tuning	~3.9M	<0.1
LoRA	~4M	<0.1
Adapter	≥20M	0.3+

Training cost is approximately equal to the original model’s iteration, since only a small number of gradients need to be propagated. Inference overhead is minimized, as Excitor requires only an incremental (per-layer) prompt-key similarity calculation without expanding the representational or computational graph; empirical end-to-end latency overhead is under 5% (Zou et al., 2024).

6. Applications, Strengths, and Limitations

LLaMA-Excitor is particularly well-suited for:

Instruction-following via synthetic, ambiguous, or noisy prompt data.
Multi-modal and vision–language tasks (e.g., image captioning, VQA) where minimal model perturbation is required.
Scenarios where catastrophic forgetting or base knowledge retention is critical, such as domain adaption and cross-task transfer.
Reasoning-intensive tasks, e.g., chain-of-thought prompting, where targeted attention reweighting aids intermediate computation (Abdullah et al., 14 Oct 2025).

Principal strengths include ultra-lightweight footprint, strict zero-initialization preserving base model behavior at init, and applicability to both language-only and multimodal adaptation. Limitations identified are:

Validation to date is focused on LLaMA(2)-7B; generalizability to larger or structurally distinct architectures remains to be demonstrated.
The bias-in-logits paradigm may be less expressive for tasks demanding extensive generative adaptation or deep cross-modal alignment.
Visual feature integration is so far limited to last-layer CLIP features; multi-scale or hierarchical prompts may enhance results.
Potential for hallucination in generated content, e.g., spurious image captions.
Combining with other PEFT (LoRA, QLoRA) improves some multi-modal metrics but may re-introduce knowledge loss in language-only settings.

Future directions identified include: extending Excitor to Mixture-of-Experts layers, joint application with quantized PEFT, automated selection of optimal attention layers, and further compression for on-device adaptation (Zou et al., 2024, Abdullah et al., 14 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (2)

LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction (2024)

Evolution of meta's llama models and parameter-efficient fine-tuning of large language models: a survey (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LLaMA-Excitor.