Papers
Topics
Authors
Recent
Search
2000 character limit reached

LLaMA-Excitor: Parameter-Efficient Fine-Tuning

Updated 18 April 2026
  • LLaMA-Excitor is a PEFT approach that enhances LLaMA models by inserting trainable Excitor blocks into the self-attention path for targeted feature reweighting.
  • It updates only around 7.6M parameters (<0.1% of the model) while keeping the base weights frozen, ensuring robust retention of pre-trained knowledge.
  • Empirical evaluations demonstrate improved accuracy on language and vision benchmarks with low computational overhead and strong cross-dataset consistency.

LLaMA-Excitor is a parameter-efficient fine-tuning (PEFT) method explicitly designed to adapt pre-trained LLaMA LLMs by modulating self-attention mechanisms through indirect feature interaction. It achieves this by inserting a lightweight, trainable bypass module—the Excitor block—into the attention computation, allowing targeted up-weighting of salient input information without direct modification of hidden representations or base model parameters. LLaMA-Excitor has demonstrated strong performance in both language-only and multimodal (vision–language) instruction following tasks, with empirical evidence of enhanced task-specific capacity while preserving pre-trained model knowledge and minimizing parameter and compute overhead (Zou et al., 2024, Abdullah et al., 14 Oct 2025).

1. Mechanism and Architectural Integration

LLaMA-Excitor modifies the self-attention computation in Transformer decoder layers by introducing an Excitor block operating in parallel to the frozen attention path. At each layer ll of a frozen LLaMA model:

  • Input token embeddings Tl∈RM×CT_l \in \mathbb{R}^{M\times C} are projected to Query, Key, and Value matrices via original, frozen weights:

Query=Wq(Tl),Key=Wk(Tl),Value=Wv(Tl)\text{Query} = W_q(T_l),\quad \text{Key} = W_k(T_l),\quad \text{Value} = W_v(T_l)

  • Excitor injects a learnable set of prompts Pl∈RK×CP_l \in \mathbb{R}^{K\times C} and a scalar gate glg_l. It reconstructs an auxiliary key, Keyextra\mathrm{Key}_{\text{extra}}, from PlP_l and projects the input embeddings via a low-rank bottleneck.

The Excitor block outputs an auxiliary similarity matrix

Slextra=(Queryâ‹…KeyextraT)/CS_l^{\text{extra}} = (\text{Query}\cdot\mathrm{Key}_{\text{extra}}^T)/\sqrt{C}

which is gated and added to the original similarity Sl=(Queryâ‹…KeyT)/CS_l=(\text{Query}\cdot\text{Key}^T)/\sqrt{C}, to form the final augmented attention: Slg=Softmax(Sl+glâ‹…Slextra)S_l^g = \mathrm{Softmax} \bigl( S_l + g_l \cdot S_l^{\text{extra}} \bigr) The post-attention output is

Tl∈RM×CT_l \in \mathbb{R}^{M\times C}0

Crucially, the value vectors themselves remain unchanged; only their dynamic mixing, as determined by the attended tokens, is altered. This allows indirect feature interaction: the Excitor module does not inject new hidden states but selectively re-weights how frozen representations are used (Zou et al., 2024).

2. Training Methodology

LLaMA-Excitor is trained by updating only the Excitor block parameters (per-layer prompts Tl∈RM×CT_l \in \mathbb{R}^{M\times C}1, bottleneck projections Tl∈RM×CT_l \in \mathbb{R}^{M\times C}2, and gates Tl∈RM×CT_l \in \mathbb{R}^{M\times C}3), with all base LLaMA weights strictly frozen. The following datasets and hyperparameters were used for empirical evaluation:

  • Language-only tuning: Stanford Alpaca (52K machine-generated instruction–response pairs).
  • Multimodal tuning: MSCOCO (0.6M image captioning pairs), LLaVA665k (0.66M visual instruction-following pairs), ScienceQA (21K multimodal multiple-choice questions).

Optimization settings included 5 training epochs on 8×A100 GPUs, a batch size of 64, a learning rate of Tl∈RM×CT_l \in \mathbb{R}^{M\times C}4, weight decay 0.02, and decoding with top-Tl∈RM×CT_l \in \mathbb{R}^{M\times C}5, temperature=0.1. The total trainable parameter count for a 30-layer LLaMA-7B configuration (Tl∈RM×CT_l \in \mathbb{R}^{M\times C}6, Tl∈RM×CT_l \in \mathbb{R}^{M\times C}7, Tl∈RM×CT_l \in \mathbb{R}^{M\times C}8) is approximately 7.6M, less than 0.1% of the model, which compares favorably to adapter or LoRA alternatives (Zou et al., 2024, Abdullah et al., 14 Oct 2025).

3. Preservation of Base Model Knowledge

A defining attribute of LLaMA-Excitor is empirical preservation—or improvement—of base model capabilities. In contrast to adapter and LoRA PEFT methods, which typically degrade pre-trained LLM accuracy on out-of-domain or general tasks, LLaMA-Excitor demonstrated:

  • Zero or positive retention on MMLU: fine-tuned LLaMA-Excitor-7B on Alpaca-52K exhibited a +3.12% MMLU accuracy increase (from ~35.1% to ~38.2%), whereas Adapter and LoRA variants dropped by 3–6% (Zou et al., 2024, Abdullah et al., 14 Oct 2025).
  • Cross-dataset consistency: post-finetuning accuracy on ARC, HellaSwag, TruthfulQA, and MMLU remained within ±1%. Competing methods exhibited substantial performance deterioration.
  • Ablations in the Excitor projection path demonstrated that best results occurred when prompts were used as keys/values without additional projection and that modifying the token or prompt projection order could adversely affect retention.

These findings substantiate LLaMA-Excitor's core hypothesis that biasing attention, rather than directly perturbing hidden representations, confers robust out-of-distribution generalization and resistance to catastrophic forgetting.

4. Performance on Evaluation Benchmarks

LLaMA-Excitor achieved competitive or state-of-the-art results across several benchmarks:

Task/Dataset Metric Excitor Score Notable Comparison
MMLU (language) mAcc +3.12% over baseline Adapter/LoRA: 3–6% drop
MS-COCO Image Caption CIDEr 157.5 BLIP-2: 145.3
ScienceQA Acc. 88.39% (w/ LoRA&CLIP) LLaVA (13B): 90.92%
VQA-v2 Acc. 83.6% +3.6% over prior SOTA
GQA Acc. 62.1% Competitive with SOTA

The MS-COCO result is notable: despite lacking vision–language pretraining or alignment modules, Excitor outperformed prior models (Flamingo, mPLUG-Owl2, BLIP-2) by as much as 12.2 CIDEr (Zou et al., 2024). ScienceQA accuracy was comparable to or better than much larger and more extensively trained models. Similar efficiency was observed on VQA-v2 and GQA.

5. Parameter and Computational Efficiency

With total trainable parameters Tl∈RM×CT_l \in \mathbb{R}^{M\times C}9M (Query=Wq(Tl),Key=Wk(Tl),Value=Wv(Tl)\text{Query} = W_q(T_l),\quad \text{Key} = W_k(T_l),\quad \text{Value} = W_v(T_l)0 for LLaMA-7B), LLaMA-Excitor is highly parameter-efficient. Comparisons from the original studies:

Method Trainable Params (LLaMA-7B) Overhead (%)
Excitor ~7.6M <0.1
Prefix-tuning ~3.9M <0.1
LoRA ~4M <0.1
Adapter ≥20M 0.3+

Training cost is approximately equal to the original model’s iteration, since only a small number of gradients need to be propagated. Inference overhead is minimized, as Excitor requires only an incremental (per-layer) prompt-key similarity calculation without expanding the representational or computational graph; empirical end-to-end latency overhead is under 5% (Zou et al., 2024).

6. Applications, Strengths, and Limitations

LLaMA-Excitor is particularly well-suited for:

  • Instruction-following via synthetic, ambiguous, or noisy prompt data.
  • Multi-modal and vision–language tasks (e.g., image captioning, VQA) where minimal model perturbation is required.
  • Scenarios where catastrophic forgetting or base knowledge retention is critical, such as domain adaption and cross-task transfer.
  • Reasoning-intensive tasks, e.g., chain-of-thought prompting, where targeted attention reweighting aids intermediate computation (Abdullah et al., 14 Oct 2025).

Principal strengths include ultra-lightweight footprint, strict zero-initialization preserving base model behavior at init, and applicability to both language-only and multimodal adaptation. Limitations identified are:

  • Validation to date is focused on LLaMA(2)-7B; generalizability to larger or structurally distinct architectures remains to be demonstrated.
  • The bias-in-logits paradigm may be less expressive for tasks demanding extensive generative adaptation or deep cross-modal alignment.
  • Visual feature integration is so far limited to last-layer CLIP features; multi-scale or hierarchical prompts may enhance results.
  • Potential for hallucination in generated content, e.g., spurious image captions.
  • Combining with other PEFT (LoRA, QLoRA) improves some multi-modal metrics but may re-introduce knowledge loss in language-only settings.

Future directions identified include: extending Excitor to Mixture-of-Experts layers, joint application with quantized PEFT, automated selection of optimal attention layers, and further compression for on-device adaptation (Zou et al., 2024, Abdullah et al., 14 Oct 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LLaMA-Excitor.