Papers
Topics
Authors
Recent
2000 character limit reached

DAMER: Dual-View Music Emotion Recogniser

Updated 23 December 2025
  • DAMER is an advanced multimodal framework that fuses acoustic, physiological, and time–frequency features to capture complementary emotional cues in music.
  • The framework integrates dual-view attention fusion, progressive confidence labeling, contrastive memory, and adversarial adaptation to enhance robustness.
  • Empirical evaluations on benchmarks like Memo2496, DEAP, and 1000songs demonstrate state-of-the-art performance in arousal and valence recognition.

The Dual-view Adaptive Music Emotion Recogniser (DAMER) is an advanced multimodal framework designed to enhance music emotion recognition (MER) by fusing heterogeneous information streams, including acoustic, physiological, and time–frequency representations. DAMER architectures have been instantiated in several state-of-the-art studies to address the pervasive challenges of subjective affect annotation variability, feature drift between tracks, and the limited predictive power of single-modality approaches (Li et al., 16 Dec 2025, Avramidis et al., 2022, Yin et al., 2020). Key DAMER systems integrate sophisticated deep learning architectures, fusion modules, semi/self-supervised strategies, and robust training protocols to maximize recognition accuracy and generalizability across diverse MER datasets.

1. Dual-View Input Modalities and Feature Preparation

DAMER systems fundamentally process two distinct views of music or music-induced response data, designed to capture complementary aspects of emotional perception.

  • Time–frequency audio representations: Architectures utilize Mel spectrograms and cochleagrams as parallel streams, derived from raw or preprocessed audio using pre-emphasis, short-time Fourier transform, Mel filterbanks, and gammatone-based cochlear filtering. Output: HMelRB×NMel×D\mathbf{H}^{\rm Mel} \in \mathbb{R}^{B \times N_{\rm Mel} \times D}, HCochRB×NCoch×D\mathbf{H}^{\rm Coch} \in \mathbb{R}^{B \times N_{\rm Coch} \times D} (Li et al., 16 Dec 2025).
  • Physiological/behavioral signal streams: EEG (electroencephalography) signals (e.g., 32-channel DEAP EEG at 128 Hz sampled in 3 s non-overlapping segments, differential entropy per band/channel, yielding XRT×32X \in \mathbb{R}^{T \times 32}) (Avramidis et al., 2022), or EDA/GSR (electrodermal activity/galvanic skin response) decomposed into phasic/tonic/residual signals using CvxEDA (Yin et al., 2020).
  • Acoustic feature vectors: For architectures using EDA, precomputed, static music feature vectors (IS10-Paraling from openSMILE, 1582-dimensional, fmusicR1582f_{\rm music} \in \mathbb{R}^{1582}) serve as external emotion priors (Yin et al., 2020).

This dual-view arrangement provides both stimulus-driven (audio) and subject-adaptive (physio/behavioral) perspectives, shown empirically to yield optimal arousal and valence discrimination (Li et al., 16 Dec 2025).

2. Core Architectural Modules

DAMER frameworks are characterized by specialized modules for cross-modal interaction, fusion, and robust learning.

2.1. Dual-Stream Attention Fusion (DSAF)

DSAF is a bidirectional cross-attention transformer enabling mutual fine-grained correlation mining between Mel and cochleagram tokens. At each transformer layer, both streams alternately serve as query/key/value for multi-head attention (MHA): HMelCoch=MHA(Q=HMel,K=HCoch,V=HCoch)\mathbf{H}_{\rm Mel \leftarrow Coch} = \mathrm{MHA}(\mathbf{Q}=\mathbf{H}^{\rm Mel}, \mathbf{K}=\mathbf{H}^{\rm Coch}, \mathbf{V}=\mathbf{H}^{\rm Coch}) with subsequent residual connection, layer normalization, and MLP: FFN(x)=Linear2(GELU(Linear1(x)))\mathrm{FFN}(x) = \mathrm{Linear}_2(\mathrm{GELU}(\mathrm{Linear}_1(x))) The process is repeated symmetrically for cochleagram queries (Li et al., 16 Dec 2025).

2.2. Progressive Confidence Labelling (PCL)

PCL addresses annotation scarcity via curriculum-driven pseudo-labelling, employing a temperature-scheduled softmax: τ(t)=τmax(τmaxτmin)tTtotal\tau(t)=\tau_{\max}-(\tau_{\max}-\tau_{\min})\frac{t}{T_\mathrm{total}} Reliability is estimated with Jensen-Shannon Divergence (JSD) between Mel and cochleagram predictions: r=exp(JSD(pMelpCoch))r = \exp\left(-\mathrm{JSD}(\mathbf{p}^{\rm Mel}\Vert\mathbf{p}^{\rm Coch})\right) Samples satisfying dynamic confidence thresholds are assigned pseudo-labels and contribute to a weighted cross-entropy loss (Li et al., 16 Dec 2025).

2.3. Style-Anchored Memory Learning (SAML)

SAML incorporates a FIFO memory queue (K=512K=512) of L2-normalized fused feature vectors with labels, applying a supervised contrastive (InfoNCE) loss: Lcont(qi)=j=1KI(yi=yj)logexp(qikj/τc)k=1Kexp(qikk/τc)\mathcal{L}_{\rm cont}(\mathbf{q}_i) = -\sum_{j=1}^K \mathbb{I}(y_i = y_j) \log\frac{\exp(\mathbf{q}_i \cdot \mathbf{k}_j / \tau_c)}{\sum_{k=1}^K \exp(\mathbf{q}_i \cdot \mathbf{k}_k / \tau_c)} This anchors class distinctions and improves robustness to style/track drift (Li et al., 16 Dec 2025).

2.4. Latent Domain Adaptation (GRL/Adversarial)

For EEG–audio adaptation, a domain discriminator with a gradient reversal layer (GRL) enforces feature alignment across modalities, minimizing

Ladv=Ex[logD(GEEG(x))]Ey[log(1D(GAudio(y)))]\mathcal{L}_{\rm adv} = -\mathbb{E}_{x}[\log D(G_{\rm EEG}(x))] - \mathbb{E}_{y}[\log(1 - D(G_{\rm Audio}(y)))]

where the feature extractor gradients are reversed, promoting domain-invariant latent spaces (Avramidis et al., 2022).

2.5. Residual Temporal–Channel Attention

Physio-driven DAMER systems employ RTCAG and RFE: 1D convolutions, residual nonlocal temporal attention (RNTA), and signal-channel attention (SCA), yielding highly representative time-series features before fusion and MLP classification (Yin et al., 2020).

3. Learning Objectives and Optimization

DAMER architectures universally optimize a joint loss composed of supervised, pseudo-labelling, consistency, adversarial, and contrastive components. The canonical form in the most recent DAMER system is: Ltotal=λ1Lcls+λ2LPL+λ3Lconsistency+λ4Lcont\mathcal{L}_{\rm total} = \lambda_1\mathcal{L}_{\rm cls} + \lambda_2\mathcal{L}_{\rm PL} + \lambda_3\mathcal{L}_{\rm consistency} + \lambda_4\mathcal{L}_{\rm cont} with cross-entropy classification loss Lcls\mathcal{L}_{\rm cls}, pseudo-label loss LPL\mathcal{L}_{\rm PL}, Jensen–Shannon divergence loss for consistency, and supervised contrastive loss for memory anchoring. Hyperparameters: λ1=1.0,λ2=0.8,λ3=0.2,λ4=0.1\lambda_1=1.0, \lambda_2=0.8, \lambda_3=0.2, \lambda_4=0.1 (Li et al., 16 Dec 2025). Optimizers include Adam (LR 1×1031\times 10^{-3}), cosine-annealing schedule, batch size 32 per GPU, and gradient clipping.

4. Datasets, Annotation, and Experimental Protocols

DAMER research leverages several expert-annotated, large-scale, and crowd-sourced datasets:

Dataset Modality Size Annotation Protocol
Memo2496 Audio 2496 tracks 30 experts, V–A continuous, calibrated, consistency 0.25\leq 0.25
DEAP Video+EEG/EDA 32 × 40 subjects/videos 40×60 s, valence/arousal [1–9], EEG/EDA
PMEmo Audio+EDA 457 × ~17 subjects/tracks Continuous V–A, resampled, GSR/EDA
1000songs Audio 1000 tracks Crowd-sourced, V–A, binarized
AMIGOS Video+EDA 40 × 16 subjects/clips GSR/EDA, short videos

Memo2496 annotation included calibration with “extreme” exemplars, rest protocols to reduce carry-over, and expert cross-annotation for intra/inter-consistency (Li et al., 16 Dec 2025). Audio is always pre-emphasized, normalized (–23 LUFS), centered (avoid edges), and featurized into both Mel/ cochleagram representations.

5. Comparative Evaluation and State-of-the-Art Results

DAMER systems have established new state-of-the-art results in music emotion recognition by outperforming strong baselines on every benchmark:

  • Memo2496 (expert): Arousal 82.95% Acc (+3.43% vs MCGC-net), F1 80.70%, AUC 90.21%; Valence 78.34% Acc, F1 84.72%
  • 1000songs: Arousal 81.28% Acc (+2.25%), Valence 70.74%
  • PMEmo: Arousal 85.98% Acc (+0.17%), F1 90.58%; Valence 77.61%, F1 85.54%
  • DEAP (EEG→DAMER): Valence 70.4% (aggregate), Arousal 68.9%, vs EEG-only 67.8%/68.0% (Avramidis et al., 2022)
  • PMEmo (EDA+music): Valence 79.68%, Arousal 83.76% (Yin et al., 2020)

Ablation across all major DAMER modules (DSAF, PCL, SAML) consistently demonstrates that each contributes distinct, complementary accuracy gains (Li et al., 16 Dec 2025). Visualization (t-SNE, entropy, intra-class distance) provides qualitative support for better class separation and style invariance.

6. Analysis, Insights, and Implications

Experimental and visualization analyses across multiple DAMER studies yield the following insights:

  • Cross-track feature drift is effectively mitigated via contrastive memory (SAML), as seen in intra-class feature stabilization metrics (entropy ≈ 0.685, L2 centroid distance ≈ 0.60–0.65).
  • Pseudo-labelling (PCL) rapidly attains stable confidence/reliability (>0.90) and mask coverage (>92%) as temperature and threshold are annealed, supporting effective use of unlabeled data under curriculum (Li et al., 16 Dec 2025).
  • Temporal affective variance: EEG mAP fluctuates across song segments, suggesting moments of strong entrainment. This temporal alignment is only revealed after latent cross-modal fusion (Avramidis et al., 2022).
  • Domain adaptation: Removal of adversarial GRL reduces EEG accuracy by up to 4% and degrades cross-modal retrieval, highlighting its necessity for domain-invariant feature learning (Avramidis et al., 2022).
  • Model efficiency: End-to-end 1D-ResNet models (e.g., RTCAN-1D) reach comparable state-of-the-art accuracy with much lighter computation when channel/temporal attention and fusion are leveraged appropriately (Yin et al., 2020).

A plausible implication is that DAMER’s modular framework is agnostic to precise modality selection, but its core principles—orthogonal feature fusion, semi-supervised label mining, and domain or style-regularized learning—generalize broadly to affective computing across music, physiological, and multimodal behavioral data.

7. Implementation Notes and Public Resources

DAMER systems are implemented with standardized settings to ensure reproducibility and fair benchmarking:

  • Embedding dimensions: e.g., D=128D=128, Df=256D_f=256 (DSAF) (Li et al., 16 Dec 2025)
  • Cross-attention: 4 heads, 2 layers; memory queue K=512K=512; momentum 0.95
  • Optimizers: Adam, batch size 32 (per GPU), cosine annealing LR
  • Audio: –23 LUFS normalization, central 60 s segments, no augmentation beyond cropping
  • Hardware: A100 80 GB GPUs, mixed precision, grad norm clipping at 5.0
  • All code and expert-annotated data (Memo2496) are publicly available, supporting open research and follow-up studies in music informatics and affective computing (Li et al., 16 Dec 2025).

Overall, DAMER represents a family of adaptively fused, cross-modal, and robust deep learning architectures that have achieved state-of-the-art performance in music emotion recognition and established new standards for dataset quality, annotation consistency, and methodological transparency.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Dual-view Adaptive Music Emotion Recogniser (DAMER).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube