OV-MERD: Open-Vocabulary Multimodal Emotion Recognition
- The dataset OV-MERD is designed for open-vocabulary multimodal emotion recognition by using free-form natural language annotations that capture culturally diverse and nuanced emotions.
- It features 332 short video clips with aligned audio and subtitles, facilitating evaluation of multimodal fusion strategies that enhance overall recognition performance.
- A multi-stage annotation protocol combining manual review and GPT-based clustering ensures detailed benchmarking and robust fusion across audio, visual, and textual modalities.
The Open-Vocabulary Multimodal Emotion Recognition Dataset (OV-MERD) is a corpus designed to advance multimodal emotion recognition (MER) by removing the constraints of closed emotion taxonomies and enabling models to operate on complex, fine-grained, and culturally diverse emotional states. OV-MERD consists of short video clips with aligned audio and transcript modalities, annotated using unconstrained, natural-language emotion descriptors. Its primary objective is to facilitate benchmarking and model development for open-vocabulary MER, enhancing coverage, generalizability, and interpretability in affective computing across audio, visual, and textual modalities (Han et al., 24 Dec 2025, Lian et al., 2024).
1. Dataset Construction and Motivations
OV-MERD was built to address critical deficiencies in traditional MER datasets, which typically restrict annotations to a handful of basic emotions (e.g., anger, happiness, sadness) and thus fail to reflect the multi-appraisal and nuanced spectrum of real-world affect. The OV-MERD corpus enables free-form, open-vocabulary labeling, capturing subtle, compound, and culturally specific expressions. Its annotations originate from the ACM Multimedia 2024 MER challenge and further extend methodologies from MER2023 (Han et al., 24 Dec 2025, Lian et al., 2024).
- Size and Composition: OV-MERD contains 332 samples, each a short video clip (0.2–22.1 s; mean 3.9 s; ~25 FPS).
- Modalities: All clips contain video; all but 20 contain audio. Subtitles (UTF-8 Chinese text) are available for each clip.
- A plausible implication is that the dataset is primarily suitable for short conversational or situational benchmarks, as the duration distribution is highly skewed toward brief episodes.
2. Multi-stage Annotation and Labeling Protocols
OV-MERD utilizes an open-vocabulary, multi-label annotation process:
- Free-form Labels: Each clip receives 1–9 natural-language emotion terms (mean ≈3.34; total unique labels: 248), including both common (e.g. "happy," "sad") and rare or nuanced (e.g. "questioning," "suspicious") expressions.
- Synonym Grouping: An automated mapping function , powered by GPT-3.5-Turbo, clusters semantically similar labels to address near-synonym redundancy. Clusters are established via prompts instructing the model to group terms for consistent evaluation.
- Quality Assurance: While two stages of manual and LLM-based review are conducted in (Lian et al., 2024), no inter-annotator agreement metrics or detailed human annotation guidelines are provided for the ACM 2024 challenge protocol in (Han et al., 24 Dec 2025).
- Multi-language Consistency: The annotation process in (Lian et al., 2024) includes English/Chinese label extraction and translation, with Jaccard similarity used to quantify cross-lingual alignment—though this is not explicitly mentioned in (Han et al., 24 Dec 2025).
The annotation schema consists of flat lists of emotion descriptors per sample, defined as JSON-like objects:
1 2 3 4 5 |
{
"clip_id": "VID_0234",
"duration_s": 4.12,
"emotions": ["suspicious", "angry", "dissatisfied", "questioning"]
} |
3. Data Processing and Multimodal Integration
- Video Processing: Two frame-sampling regimes are supported:
- Fixed: uniformly sample 24 frames per clip.
- Dynamic: sample frames/sec (), proportional to clip length.
- No additional resizing, normalization, or pixel-level preprocessing.
- Audio Processing: Raw waveforms are directly input into Audio-LLMs without explicit feature extraction (e.g., MFCC, spectrograms).
- Text Processing: Subtitle strings are included as plain UTF-8 text; no tokenization or vocabulary statistics provided.
- Cross-modal Alignment: The text, audio, and video streams are assumed to be co-located and time-aligned at the clip level; no explicit timestamp matching is described.
This suggests that OV-MERD prioritizes pragmatic integration for off-the-shelf multimodal architectures and prompt-based pipelines, foregoing fine-grained sample-level synchronization or low-level preprocessing.
4. Evaluation Metrics and Benchmark Protocols
The OV-MERD benchmark employs set-level metrics to accommodate open-vocabulary, label-clustered evaluation:
Let be ground-truth labels, the predictions, and a cluster mapping, then:
No closed-set, top-k, CLIPScore, or matching-based metrics (e.g. BLEU, ROUGE) are reported in (Han et al., 24 Dec 2025), and such metrics show only weak correlation with (Lian et al., 2024).
Performance Benchmarks:
Unimodal, bimodal, and trimodal fusion ablations reveal:
- Text-only: =55.0%
- Video-only: =57.6%
- Audio-only: =47.2%
- Bimodal (text+video): =57.5%
- Bimodal (video+audio): =56.1%
- Trimodal (audio+video+text): =61.0% This establishes the essential role of video information and demonstrates clear synergistic effects when all modalities are fused.
5. Model Architectures and Fusion Strategies
OV-MERD supports benchmarking with a wide spectrum of multimodal and LLM architectures:
- Video-LLMs: InternVL2.5-26B, LLaVA-NeXT-Video-7B-DPO, LLaVA-Video-7B-Qwen2, Tarsier2-7B, GPT-4o-mini.
- Audio-LLMs: Qwen-Audio, Qwen2-Audio, Gemini 1.5-pro, Gemini 2.0-flash, Gemini 2.5-pro.
- General-purpose LLMs: Gemma2-9B, Llama3.1-8B, Qwen2.5-7B/32B/72B, DeepSeek-V3, OpenAI o3-mini.
- Reasoning-enhanced LLMs: DeepSeek-R1, OpenAI o3-mini (with reasoning).
The benchmark highlights a two-stage, "emotional clue-based" trimodal fusion pipeline:
- Clue Extraction: Each modality's salient emotional signals are extracted—frames via Video-LLM, audio via Audio-LLM, text as raw subtitles—producing clues , , .
- Fusion and Labeling: A general LLM is prompted with all clues to output natural-language emotion terms.
Mathematically:
- The two-stage fusion outperforms all single or bimodal approaches.
- Prompt engineering (few-shot, self-consistency, least-to-most prompting) provides incremental gains (~1–2% ).
- Reasoning-specialized models do not surpass well-prompted, general LLMs.
6. Distribution, Access, and Usage Guidelines
- Label Distribution: A long-tail distribution is observed, with common emotions appearing frequently and many nuanced emotions present infrequently.
- Format: Each sample contains links to the video, audio, transcript, and final merged emotion annotations (no formal LaTeX or JSON schema; sample-level records as shown above).
- Access: OV-MERD is made available under a CC BY-NC 4.0 license (non-commercial only), with code and dataset provided via GitHub. Supplementary materials contain the full data pack and linkage to the MER2023 parent corpus (Lian et al., 2024).
- Usage Recommendations: Multimodal fusion should be employed for optimal performance. EW-based grouping is recommended to reduce GPT-API dependency for reproducible metrics. Benchmarking should rank models by to balance precision and recall.
7. Limitations and Future Extensions
- Dataset Scale: OV-MERD remains small compared to broader vision-language benchmarks; future expansions should scale sample count and label diversity.
- Modality Weakness: Audio remains the weakest unimodal performer in current pipelines; refined acoustic feature engineering and frame-aligned preprocessing (e.g., spectrogram, facial-action unit analysis) are recommended.
- Label Weighting: All emotion terms are treated equally; future protocols may differentiate core vs. nuance or apply weighted scoring.
- Annotation Guidelines: The absence of publicly available annotation criteria and lack of inter-annotator agreement quantification is a limitation.
- Cross-lingual Consistency: While label merging and translation protocol exist, systematic studies of cross-cultural label drift remain needed.
- Model Development: End-to-end fine-tuning of MLLMs, in place of prompt-only fusion, is advised for subsequent research.
A plausible implication is that as open-vocabulary MER moves toward increased scale and diversity, systematic documentation of annotation practices, multimodal alignment techniques, and more granular benchmarking protocols will be essential for robust, generalizable results in affective computing.