VAEmotionLLM: Vision Anchored Audio-Visual Emotion Model
- The paper introduces a two-stage architecture employing vision-guided audio alignment and a cross-modal emotion adapter to achieve state-of-the-art emotion recognition.
- A lightweight audio adapter and EmoAdapter enable efficient freeze-tuning of the underlying vision-language model with minimal data requirements.
- Empirical results on ArtEmoBenchmark demonstrate significant accuracy improvements across audio, visual, and joint modalities, underscoring its scalability and generalizability.
Vision Anchored Audio-Visual Emotion LLM (VAEmotionLLM) is a two-stage LLM architecture designed to endow vision-LLMs with robust “hearing” and cross-modal emotion understanding across visual, audio, and combined artistic domains. Its pivotal innovation is to leverage vision as a supervisory anchor for audio, achieving high emotion recognition and reasoning accuracy with minimal data and compact adaptation modules. VAEmotionLLM establishes a new state-of-the-art on artistic emotion evaluation benchmarks and demonstrates substantial scalability and generalizability within limited-data regimes (Zhang et al., 15 Nov 2025).
1. Architectural Overview
VAEmotionLLM is built atop a frozen visual-LLM (VLM), consisting of a vision encoder (e.g., InternVL) and a LLM , both parameter-frozen throughout training. Two new learnable components are introduced:
- Audio Adapter : a parameterized network that connects to a pretrained, frozen audio encoder . This adapter enables the LLM to process audio via an aligned latent space.
- Cross-Modal Emotion Adapter ("EmoAdapter"): a lightweight module composed of an Emotion Enhancer and an Emotion Supervisor, responsible for injecting emotion-sensitive residuals and imposing explicit emotion-aware supervision.
Inputs are video frames and audio spectrograms . Training proceeds in two stages:
- Stage 1 (Vision-Guided Audio Alignment, VG-Align): aligns audio-encoded representations with their visual counterparts through response-level next-token distribution matching.
- Stage 2 (EmoAdapter): explicitly models and enhances cross-modal emotion representations with residual MLP injections and Valence-Arousal supervision, yielding a cross-modally competent LLM.
All VLM and audio encoder weights remain frozen; only the Audio Adapter, EmoAdapter, and minor LoRA parameters in the LLM are updated (Zhang et al., 15 Nov 2025).
2. Stage 1: Vision-Guided Audio Alignment (VG-Align)
Objective: Enable the LLM to “hear by seeing”, i.e., to produce similar output distributions for synchronized audio and video inputs by treating the VLM’s response as the supervisory anchor.
- Teacher Pathway (Video): (next-token logits at step ).
- Student Pathway (Audio): .
The trainable parameters in are optimized to minimize the soft cross-entropy (KL divergence) between the next-token distributions of the video and audio paths:
where is the number of LLM output positions, denotes the softmax, and a temperature. Critically, this achieves response-level alignment without requiring massive audio pretraining: 70.5 hours of unlabeled, synchronized audio-video data (VGGSound) suffice (Zhang et al., 15 Nov 2025).
3. Stage 2: Cross-Modal Emotion Adapter (EmoAdapter)
Following alignment, cross-modal emotional semantics are injected and supervised.
- Emotion Enhancer: A residual two-layer MLP (weights , ) is applied to the aligned tokens across all modalities :
This process injects learnable, modality-agnostic emotion perturbations.
- Emotion Supervisor: Enhanced tokens are pooled (), then mapped via to a continuous Valence-Arousal vector . The supervision loss is
where . Final training jointly optimizes the LLM modeling loss and the emotion supervision loss with :
All enhancements are performed via adapters/freeze-tuning, promoting scalability and modularity (Zhang et al., 15 Nov 2025).
4. ArtEmoBenchmark and Empirical Evaluation
ArtEmoBenchmark is a large-scale evaluation protocol specifically constructed for art-centric, audio-visual emotion understanding.
- Dataset: 1200 multi-choice questions across 1432 film clips, each paired with background score and visuals.
- Modalities: Audio-only (A), Visual-only (V), Audio-Visual Joint (AV).
- Question Types: Overall content, overall emotion, specific content, specific emotion; partitioned into audio-centric, video-centric, and cross-modal groups.
- Metrics: Accuracy averaged over all groups/modalities.
State-of-the-art Results (Accuracy, %):
| Modality | VAEmotionLLM | Prior Best Baseline | Delta |
|---|---|---|---|
| A-only | 77.0 | Qwen2-Audio: 69.0 | +8.0 |
| V-only | 87.8 | InternVL2.5: 83.0 | +4.8 |
| AV joint | 53.5 | AffectGPT: 42.0 | +11.5 |
| Overall mean | 72.8 | Prior best: 63.0 | +9.8 |
These results demonstrate both robust single-modality performance and unprecedented joint-modality (AV) competence (Zhang et al., 15 Nov 2025).
5. Ablation Analysis
Ablation studies confirm the necessity and complementarity of model components.
| Components | A-only | V-only | AV | Overall |
|---|---|---|---|---|
| Audio Adapter only | 29.5 | 72.3 | – | – |
| Adapter + LoRA | 45.8 | 72.0 | 37.5 | 51.8 |
| EmoEnhancer only (+LoRA) | 52.3 | 76.8 | 43.8 | 57.6 |
| EmoSupervisor only | 46.8 | 73.0 | 38.3 | 52.7 |
| Full Model (Adapter+LoRA+Enh.+Sup.) | 77.0 | 87.8 | 53.5 | 72.8 |
- Removing the Audio Adapter collapses A-only performance.
- LoRA in the LLM boosts A-only to 46%.
- Adding Emotion Enhancer or Supervisor yields moderate gains; both together provide additive and complementary improvements (+20.1 over Supervisor alone, +15.2 over Enhancer alone), affirming cross-modal synergy (Zhang et al., 15 Nov 2025).
6. Scalability, Significance, and Extensions
Scalability: The VG-Align strategy requires only 70 hours of synchronized audio-visual data to endow vision-LLMs with hearing, in contrast to 100k+ hours demanded by prior AVLM frameworks. This enables rapid adaptation and re-use of frozen VLM backbones.
Cross-modal Emotion Modeling: The shared residuals in the EmoAdapter enforce a joint emotional representation, capturing nuanced relationships between soundtrack and visual cues (e.g., the affective impact of musical motifs in visual context), which offloads the challenge of cross-modal reasoning from the LLM core to lightweight adapters.
Design Trade-offs: Minimal audio pretraining and largely frozen backbones yield highly competitive single-modality performance, while the adapter structure restores strong multimodal competence with negligible overhead.
Extension Directions: VAEmotionLLM suggests promising paths for
- Longer-range temporal and narrative reasoning,
- Richer or hierarchical emotion spaces (beyond two-dimensional Valence-Arousal representations),
- Transfer to additional artistic and performative domains (such as dance, theatre, interactive installations) (Zhang et al., 15 Nov 2025).
7. Relationship to Parallel Approaches and Outlook
VAEmotionLLM differs from earlier and parallel approaches, such as Omni-Emotion and AV-EmoDialog, by (1) focusing on efficient cross-modal alignment with limited data via vision anchoring, (2) employing lightweight adapters for emotion reasoning instead of large-scale end-to-end training, and (3) targeting artistic (rather than strictly human-centered) emotion expression (Yang et al., 16 Jan 2025, Park et al., 23 Dec 2024).
Other methods emphasize multi-instruction tuning, detailed face and micro-expression modeling, or cascaded fusion networks, often requiring more extensive data and computational resources (Tan et al., 22 Aug 2025, Cheng et al., 5 May 2025). VAEmotionLLM’s alignment and adapter mechanisms offer a complementary, efficient paradigm, suggestive of broader model reusability and rapid deployment potential.
In summary, VAEmotionLLM demonstrates that it is possible to efficiently train vision-LLMs to perceive, interpret, and reason about emotion across audio and visual modalities—achieving state-of-the-art emotion understanding in art and other complex, cross-modal domains with modest data and computational demands (Zhang et al., 15 Nov 2025).