Papers
Topics
Authors
Recent
2000 character limit reached

VAEmotionLLM: Vision Anchored Audio-Visual Emotion Model

Updated 19 November 2025
  • The paper introduces a two-stage architecture employing vision-guided audio alignment and a cross-modal emotion adapter to achieve state-of-the-art emotion recognition.
  • A lightweight audio adapter and EmoAdapter enable efficient freeze-tuning of the underlying vision-language model with minimal data requirements.
  • Empirical results on ArtEmoBenchmark demonstrate significant accuracy improvements across audio, visual, and joint modalities, underscoring its scalability and generalizability.

Vision Anchored Audio-Visual Emotion LLM (VAEmotionLLM) is a two-stage LLM architecture designed to endow vision-LLMs with robust “hearing” and cross-modal emotion understanding across visual, audio, and combined artistic domains. Its pivotal innovation is to leverage vision as a supervisory anchor for audio, achieving high emotion recognition and reasoning accuracy with minimal data and compact adaptation modules. VAEmotionLLM establishes a new state-of-the-art on artistic emotion evaluation benchmarks and demonstrates substantial scalability and generalizability within limited-data regimes (Zhang et al., 15 Nov 2025).

1. Architectural Overview

VAEmotionLLM is built atop a frozen visual-LLM (VLM), consisting of a vision encoder fv()f_v(\cdot) (e.g., InternVL) and a LLM Fθ()F_\theta(\cdot), both parameter-frozen throughout training. Two new learnable components are introduced:

  1. Audio Adapter gϕ()g_\phi(\cdot): a parameterized network that connects to a pretrained, frozen audio encoder ea()e_a(\cdot). This adapter enables the LLM to process audio via an aligned latent space.
  2. Cross-Modal Emotion Adapter ("EmoAdapter"): a lightweight module composed of an Emotion Enhancer and an Emotion Supervisor, responsible for injecting emotion-sensitive residuals and imposing explicit emotion-aware supervision.

Inputs are video frames xvx^v and audio spectrograms xax^a. Training proceeds in two stages:

  • Stage 1 (Vision-Guided Audio Alignment, VG-Align): aligns audio-encoded representations with their visual counterparts through response-level next-token distribution matching.
  • Stage 2 (EmoAdapter): explicitly models and enhances cross-modal emotion representations with residual MLP injections and Valence-Arousal supervision, yielding a cross-modally competent LLM.

All VLM and audio encoder weights remain frozen; only the Audio Adapter, EmoAdapter, and minor LoRA parameters in the LLM are updated (Zhang et al., 15 Nov 2025).

2. Stage 1: Vision-Guided Audio Alignment (VG-Align)

Objective: Enable the LLM to “hear by seeing”, i.e., to produce similar output distributions for synchronized audio and video inputs by treating the VLM’s response as the supervisory anchor.

  • Teacher Pathway (Video): zv=fv(xv)Fθ(s,zv)v(t)z_v = f_v(x^v) \rightarrow F_\theta(s, z_v) \rightarrow \ell_v^{(t)} (next-token logits at step tt).
  • Student Pathway (Audio): za=gϕ(ea(xa))Fθ(s,za)a(t)z_a = g_\phi(e_a(x^a)) \rightarrow F_\theta(s, z_a) \rightarrow \ell_a^{(t)}.

The trainable parameters ϕ\phi in gϕ()g_\phi(\cdot) are optimized to minimize the soft cross-entropy (KL divergence) between the next-token distributions of the video and audio paths:

Lalign=1Tt=1TCE(σ(v(t)/τ),σ(a(t)/τ))=E(v,a)[KL(PLLM(v)  PLLM(a))]\mathcal{L}_{\rm align} = \frac{1}{T}\sum_{t=1}^T \mathrm{CE}\big(\sigma(\ell_v^{(t)}/\tau), \sigma(\ell_a^{(t)}/\tau)\big) = \mathbb{E}_{(v, a)}\big[\mathrm{KL}(P_{\rm LLM}(\cdot|v)\|\;P_{\rm LLM}(\cdot|a))\big]

where TT is the number of LLM output positions, σ\sigma denotes the softmax, and τ\tau a temperature. Critically, this achieves response-level alignment without requiring massive audio pretraining: \sim70.5 hours of unlabeled, synchronized audio-video data (VGGSound) suffice (Zhang et al., 15 Nov 2025).

3. Stage 2: Cross-Modal Emotion Adapter (EmoAdapter)

Following alignment, cross-modal emotional semantics are injected and supervised.

  • Emotion Enhancer: A residual two-layer MLP (weights W1W_1, W2W_2) is applied to the aligned tokens zmRLm×d\mathbf{z}_m\in\mathbb{R}^{L_m\times d} across all modalities m{v,a,av}m\in\{v,a,av\}:

Enh(zm)=zm+W2GELU(W1LN(zm))\mathrm{Enh}(\mathbf{z}_m) = \mathbf{z}_m + W_2 \,\mathrm{GELU}(W_1\,\mathrm{LN}(\mathbf{z}_m))

This process injects learnable, modality-agnostic emotion perturbations.

  • Emotion Supervisor: Enhanced tokens are pooled (sm=1Lmt=1Lmz~m(t)\mathbf{s}_m = \frac{1}{L_m} \sum_{t=1}^{L_m} \widetilde{\mathbf{z}}_m^{(t)}), then mapped via gψ()g_\psi(\cdot) to a continuous Valence-Arousal vector y^mR2\hat{\mathbf{y}}_m \in \mathbb{R}^2. The supervision loss is

Lemo=1MmMy^mym22\mathcal{L}_{\rm emo} = \frac{1}{|\mathcal{M}|} \sum_{m\in\mathcal{M}} \|\hat{\mathbf{y}}_m - \mathbf{y}_m\|_2^2

where M{a,v,av}\mathcal{M}\subseteq\{a,v,av\}. Final training jointly optimizes the LLM modeling loss LLM\mathcal{L}_{\rm LM} and the emotion supervision loss with λ=1\lambda=1:

L=LLM+λLemo\mathcal{L} = \mathcal{L}_{\rm LM} + \lambda\,\mathcal{L}_{\rm emo}

All enhancements are performed via adapters/freeze-tuning, promoting scalability and modularity (Zhang et al., 15 Nov 2025).

4. ArtEmoBenchmark and Empirical Evaluation

ArtEmoBenchmark is a large-scale evaluation protocol specifically constructed for art-centric, audio-visual emotion understanding.

  • Dataset: 1200 multi-choice questions across 1432 film clips, each paired with background score and visuals.
  • Modalities: Audio-only (A), Visual-only (V), Audio-Visual Joint (AV).
  • Question Types: Overall content, overall emotion, specific content, specific emotion; partitioned into audio-centric, video-centric, and cross-modal groups.
  • Metrics: Accuracy averaged over all groups/modalities.

State-of-the-art Results (Accuracy, %):

Modality VAEmotionLLM Prior Best Baseline Delta
A-only 77.0 Qwen2-Audio: 69.0 +8.0
V-only 87.8 InternVL2.5: 83.0 +4.8
AV joint 53.5 AffectGPT: 42.0 +11.5
Overall mean 72.8 Prior best: 63.0 +9.8

These results demonstrate both robust single-modality performance and unprecedented joint-modality (AV) competence (Zhang et al., 15 Nov 2025).

5. Ablation Analysis

Ablation studies confirm the necessity and complementarity of model components.

Components A-only V-only AV Overall
Audio Adapter only 29.5 72.3
Adapter + LoRA 45.8 72.0 37.5 51.8
EmoEnhancer only (+LoRA) 52.3 76.8 43.8 57.6
EmoSupervisor only 46.8 73.0 38.3 52.7
Full Model (Adapter+LoRA+Enh.+Sup.) 77.0 87.8 53.5 72.8
  • Removing the Audio Adapter collapses A-only performance.
  • LoRA in the LLM boosts A-only to \sim46%.
  • Adding Emotion Enhancer or Supervisor yields moderate gains; both together provide additive and complementary improvements (+20.1 over Supervisor alone, +15.2 over Enhancer alone), affirming cross-modal synergy (Zhang et al., 15 Nov 2025).

6. Scalability, Significance, and Extensions

Scalability: The VG-Align strategy requires only \sim70 hours of synchronized audio-visual data to endow vision-LLMs with hearing, in contrast to 100k+ hours demanded by prior AVLM frameworks. This enables rapid adaptation and re-use of frozen VLM backbones.

Cross-modal Emotion Modeling: The shared residuals in the EmoAdapter enforce a joint emotional representation, capturing nuanced relationships between soundtrack and visual cues (e.g., the affective impact of musical motifs in visual context), which offloads the challenge of cross-modal reasoning from the LLM core to lightweight adapters.

Design Trade-offs: Minimal audio pretraining and largely frozen backbones yield highly competitive single-modality performance, while the adapter structure restores strong multimodal competence with negligible overhead.

Extension Directions: VAEmotionLLM suggests promising paths for

  • Longer-range temporal and narrative reasoning,
  • Richer or hierarchical emotion spaces (beyond two-dimensional Valence-Arousal representations),
  • Transfer to additional artistic and performative domains (such as dance, theatre, interactive installations) (Zhang et al., 15 Nov 2025).

7. Relationship to Parallel Approaches and Outlook

VAEmotionLLM differs from earlier and parallel approaches, such as Omni-Emotion and AV-EmoDialog, by (1) focusing on efficient cross-modal alignment with limited data via vision anchoring, (2) employing lightweight adapters for emotion reasoning instead of large-scale end-to-end training, and (3) targeting artistic (rather than strictly human-centered) emotion expression (Yang et al., 16 Jan 2025, Park et al., 23 Dec 2024).

Other methods emphasize multi-instruction tuning, detailed face and micro-expression modeling, or cascaded fusion networks, often requiring more extensive data and computational resources (Tan et al., 22 Aug 2025, Cheng et al., 5 May 2025). VAEmotionLLM’s alignment and adapter mechanisms offer a complementary, efficient paradigm, suggestive of broader model reusability and rapid deployment potential.

In summary, VAEmotionLLM demonstrates that it is possible to efficiently train vision-LLMs to perceive, interpret, and reason about emotion across audio and visual modalities—achieving state-of-the-art emotion understanding in art and other complex, cross-modal domains with modest data and computational demands (Zhang et al., 15 Nov 2025).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Vision Anchored Audio-Visual Emotion LLM (VAEmotionLLM).