LLM-Enhanced Action-Aware Multi-modal Prompt Tuning

Updated 4 November 2025

The paper introduces LLM-enhanced prompt tuning that leverages LLM-generated action triplets and state descriptions to capture detailed, compositional interactions across modalities.
It employs efficient LayerNorm tuning—updating only 2.5% of parameters—to achieve over 20% performance gains and reduce GPU memory usage by 17.6% on key benchmarks.
Empirical results on COCO and Flickr benchmarks demonstrate state-of-the-art retrieval metrics, validating the approach’s effectiveness in action-aware multi-modal adaptation.

LLM-enhanced Action-aware Multi-modal Prompt Tuning refers to a set of methodologies that leverage LLMs to inject context-rich, action-level reasoning and compositional knowledge into multi-modal systems, aiming to optimize both performance and efficiency for tasks requiring understanding across modalities (e.g., text, vision, speech, emotion). These techniques focus on enriching model prompts—specialized input tokens or embeddings—so that models not only align global semantics across modalities, but also capture fine-grained actions, states, and interactions, often in a compositional or temporally grounded manner. The most impactful approach combines the prompt construction power of LLMs with architectural adaptations and efficient tuning strategies inside large multi-modal models.

1. Foundations and Motivation

Traditional multi-modal models—most notably contrastive vision-language pretraining frameworks such as CLIP—align images and text at a global, image-level or sentence-level semantic space. However, this alignment fails to capture detailed interactions, object-level attributes, spatial reasoning, and especially actions: states and relations between entities that are critical for event comprehension and compositional understanding (Tian et al., 30 Jun 2025). Recent prompt learning methods attempt to address these limitations, compelling models to acquire structured representations by introducing learnable prompts or injected context. Nonetheless, most approaches remain action-insensitive and struggle with compositional or temporally aware alignment.

LLM-enhanced action-aware multi-modal prompt tuning surmounts these barriers by:

Using LLMs to parse actionable knowledge from raw descriptions, yielding compositional (subject-action-object) and causal state prompts;
Designing adaptive interaction modules that condition multi-modal feature aggregation on action-aware knowledge;
Integrating efficient parameter tuning strategies (e.g. LayerNorm) to scale adaptation to large models without excessive resource demands (Zhao et al., 2023).

2. LLM-generated Action Prompts and Semantic Enrichment

A core innovation is the construction of action triplet and state prompts using external LLMs (e.g., GPT-3.5, Vicuna) (Tian et al., 30 Jun 2025). Instead of relying solely on the original textual captions, LLMs parse and decompose:

Action Triplets— $\langle$ subject, action, object $\rangle$ relations capturing the compositional semantics of observed events;
Action State Descriptions—detailed causal or contextual language explicating states or dynamic properties of actions.

Generation pipeline: $\mathcal{R}_t = \text{LLM}([Q_1, T])$

$D = \text{LLM}(Q_2, \mathcal{R}_t)$

where $Q_1$ and $Q_2$ are instruction templates, $T$ is the source caption.

Both prompt types are encoded into specialized embeddings using CLIP tokenization and transformer-based triplet adapters, harmonized with the vision-LLM via MLP adapters and layer normalization. This enables each encoder layer to receive action-centric cues, supporting compositional reasoning and facilitating disambiguation among visually or linguistically similar entities performing distinct actions.

3. Adaptive Interaction Modules for Action-conditioned Feature Aggregation

To exploit the injected action-aware knowledge, an adaptive interaction module aggregates attentive visual features conditioned explicitly on LLM-derived prompts. The mechanism includes:

Splitting images into $M$ patches and encoding as $E_0$ ;
Performing cross-attention between image patches and action-aware prompts:

$\tilde{p}_0^t = \text{Cross-Attn}(p_0^t, E_0)$

Combining attended visual features ( $V^t_{cs}, V^s_{cs}$ ) using a re-weighting controlled by hyperparameter $\lambda$ :

$V = \text{MLP}([V^t_{cs}, V^s_{cs}])$

$\tilde{V} = (\lambda A^t_{cs} + (1-\lambda) A^s_{cs}) V$

Concatenating with learnable visual prompts before final transformer aggregation.

This architecture localizes and amplifies salient, action-relevant cues in the visual domain, ensuring the alignment is not simply object-centric but dynamically conditioned on contextually compositional knowledge.

4. Efficient Fine-tuning via LayerNorm and Conversational Data Selection

Scalability and efficiency are achieved through selective parameter updating (Zhao et al., 2023). Specifically, tuning only the LayerNorm parameters within attention blocks suffices for strong domain adaptation when extending LLMs to multi-modal domains. Noteworthy quantitative results:

LayerNorm tuning uses only $2.5\%$ of parameters at 13B scale (or $0.003\%$ for the simplified variant);
Delivers over $20\%$ performance improvement (averaged across five multi-modal tasks) compared to LoRA, while reducing GPU memory usage by $17.6\%$ .
Full finetuning leads to out-of-memory errors at large scale, whereas LayerNorm-based tuning remains tractable.

LayerNorm alone can almost fully substitute for vision-language connector training in multi-modal adaptation. Tuning with conversational data further increases efficiency, matching or exceeding full finetuning using only a subset of examples (e.g., LayerNorm-Conv.20k vs. Finetune-80k on MSCOCO).

5. Mathematical Formulation and Training Strategy

Prompt adaptation and action-aware interaction are underpinned by contrastive and triplet losses:

Contrastive Loss (stage 1):

$\mathcal{L}_{\text{i2t}(i) = -\log \frac{\exp(\mathrm{sim}(z^{img}_i, z^{text}_i)/\tau)}{\sum_j \exp(\mathrm{sim}(z^{img}_i, z^{text}_j)/\tau)}$

Triplet Loss (stage 2):

$\mathcal{L}_{\text{stage2} = \max(d_p - d_n + \alpha, 0)$

Image and text representations are re-ranked with fine-grained action-aware features after initial CLIP-based retrieval.

LayerNorm operation is defined by (cf. (Zhao et al., 2023)): $\mathbf{y} = \frac{\mathbf{x} - \mu}{\sigma}, \quad \mu = \frac{1}{N} \sum_{i=1}^N x_i, \quad \sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2}$ Gradient properties support stable, efficient optimization and theoretical generalization.

6. Empirical Results and Ablations

On COCO and Flickr30K benchmarks, LLM-enhanced action-aware prompt tuning achieves state-of-the-art retrieval metrics:

Method	COCO R@1 (I2T)	COCO R@1 (T2I)	Flickr R@1 (I2T)	Flickr R@1 (T2I)
CLIP B/16	52.4	33.1	81.2	62.2
FineCLIP	54.5	40.2	82.5	67.9
SaCo	57.8	39.8	85.5	69.1
Ours B/16	58.4	43.2	88.1	74.7

Ablation confirms both action triplet and state prompts are complementary—removing either degrades performance. LLM-parsed triplets outperform hand-crafted templates, and adaptive interaction is critical relative to simple concatenation. Comparable results are achieved with smaller LLMs (e.g., Llama2-13B) as knowledge extractors.

LayerNorm tuning establishes performance and resource efficiency across large-scale multi-modal benchmarks; gradient variance analysis and layer-wise similarity (cosine) metrics confirm increased expressiveness.

7. Broader Implications and Future Directions

This methodology extends the capabilities of multi-modal systems:

By leveraging LLM-injected compositional and causal knowledge, models achieve better action understanding and reasoning, crucial for tasks in vision-language, video summarization, and human-centric perception.
Efficient LayerNorm-based tuning supports scalable deployment even on consumer-grade hardware when adapting large models for novel modalities.
Action-aware prompt tuning underpins broader applications, including multi-modal reasoning, emotion understanding, and temporal event localization.

A plausible implication is that LLM-generated, action-aware prompt architectures—and LayerNorm-centric domain adaptation—may become universal templates for future multi-modal model design, bridging the gap between raw sensory data, external structured knowledge, and efficient, task-adaptive representation learning.