Meme Emotion Understanding (MEU)

Updated 21 November 2025

Meme Emotion Understanding (MEU) is a computational approach that fuses visual and linguistic data to infer emotional intent in internet memes.
State-of-the-art methods employ multimodal architectures, hierarchical reasoning, and cross-modal fusion to address challenges like sarcasm and cultural nuance.
Applications span content moderation, sentiment tracking, and emotionally aware dialogue systems, while challenges include data imbalance and the perception–cognition gap.

Meme Emotion Understanding (MEU) refers to the computational task of automatically inferring the emotional intent expressed in internet memes by interpreting both visual and linguistic modalities. MEU is critical for applications in content moderation, sentiment tracking, online safety, social computation, and emotionally aware dialogue systems. Contemporary MEU approaches unify advances in multimodal representation, hierarchical reasoning, and cross-modal fusion, but face challenges unique to the informal, culturally nuanced, and often ambiguous nature of meme data.

1. Datasets, Annotation Practices, and Taxonomies

MEU research depends on large, carefully annotated, and diverse benchmarks to capture the full scope of affective phenomena in memes. Major datasets include:

EmoBench-Reddit: 350 Reddit-derived memes from subreddits r/funny (humor), r/happy (happy), r/sadness (sad), r/sarcasm (sarcasm), using user flairs as weak emotion labels. Samples are filtered for flair consistency, safe content, strict image-text relevance, and absence of text-only memes. The dataset is evenly balanced across the four categories and provides both images and user captions (Li et al., 14 Sep 2025).
MOOD (Meme emOtiOns Dataset): 10,004 memes annotated for six Ekman basic emotions—fear, anger, joy, sadness, surprise, disgust—using dual expert annotation (Cohen’s κ = 0.86). Images are collected via emotion keywords and filtered for quality and appropriateness (Sharma et al., 15 Mar 2024).
Memotion Dataset: 9,871 memes annotated for sentiment (positive, neutral, negative), multi-label emotion categories (humor, sarcasm, offense, motivation), and fine-grained intensity scales (Sharma et al., 2020).
MET-MEME, HarMeme, Dank Memes, Reddit Meme Dataset: Cover broader emotion categories, non-English data, and append additional attributes such as offensiveness, metaphor, and intention (Konyspay et al., 21 Mar 2025, Zheng et al., 16 Mar 2025).

Annotation protocols emphasize multi-rater agreement (e.g., Cohen’s κ > 0.82 on EmoBench-Reddit MCQ/OE), majority voting, and robust guidelines blending both text and visual interpretation. Class distributions often show heavy imbalance, with emotions like joy and anger dominating meme content, while classes such as surprise and love are rare (Konyspay et al., 21 Mar 2025).

Emotion taxonomies span:

Four-class (sad, humor, sarcasm, happy) (Li et al., 14 Sep 2025)
Six-class (Ekman) (Sharma et al., 15 Mar 2024, Konyspay et al., 21 Mar 2025)
Multi-label axes (humor, sarcasm, offense, motivation) (Sharma et al., 2020)
User-generated or LMM-generated sets for context-sensitive applications (Tzelepi et al., 14 Apr 2025)

2. Hierarchical Task Frameworks and Benchmarking

MEU benchmarks increasingly move toward hierarchical evaluation to capture distinctions between low-level perception and high-level cognition:

EmoBench-Reddit defines a staged framework:
- Perception (color recognition, object presence/localization, open-ended image description)
- Cognition (simple inference, intent/emotion recognition, deep multimodal reasoning involving sarcasm and empathy)
- Each sample contains six MCQs and one OE question, aligned along a perception→cognition→empathy continuum (Li et al., 14 Sep 2025).
Open-ended evaluation: OE descriptions are scored via combined cosine similarity and LLM-judged semantic fit, passing if composite $S \geq 0.75$ .
Multiple-choice metrics: Standard accuracy, per-class F1, and precision/recall for intent categorization.

Performance profiling reveals:

SOTA MLLMs (e.g., GPT-5, Gemini-2.5-pro) achieve >0.85 accuracy on perception, but intent and deep reasoning tasks see a consistent ~15–20 point drop, especially for sarcasm (with best models $\sim$ 0.63 on deepest cognition tasks).
Under-recognition of sarcasm and poor empathy for "sad" categories are dominant error patterns (Li et al., 14 Sep 2025).
Ablations confirm “perception→cognition gap”—robust object-level recognition does not ensure deeper affective understanding.

3. Multimodal Architectures and Fusion Strategies

MEU methods emphasize the necessity of integrating vision and language, employing increasingly sophisticated fusion, attention, and reasoning architectures:

ALFRED (MOOD dataset):
- BERT for text and ViT for image backbone
- Emotion-enriched visual embeddings via ViT fine-tuned on AffectNet
- Gated Multimodal Fusion (GMF) for selective merging of emotion and content cues
- Gated Cross-Modal Attention (GCA) for conditioning text on vision and vice versa (Sharma et al., 15 Mar 2024)
MemoDetector:
- Four-step MLLM-driven textual enhancement (image description, text meaning, implicit meaning, context analysis)
- Dual-stage modal fusion: shallow cross-modal concatenation (ViT + XLM-R) followed by deep bidirectional cross-attention over enhanced embeddings
- Empirically, the inclusion of all enhancement steps and two-stage fusion yields superior macro-F1 (+4.3% on MET-MEME, +3.4% on MOOD) (Shi et al., 14 Nov 2025)
CDEL (Cluster-based Deep Ensemble Learning):
- Hierarchical clustering on facial embeddings as a proxy for emotional expression, with one-hot cluster ID concatenated with visual/text features for ensemble learning (Guo et al., 2023)
SEFusion: Lightweight adaptation of squeeze-and-excitation blocks for learning scalar gates over modality channels, enabling weight sharing across downstream emotion/fine-grained tasks (Guo et al., 2023)
Simpler methods (e.g., RoBERTa + ResNet, BERT + DenseNet) utilize early/late fusion with limited cross-modal grounding; performance drops when fusing modalities without strong alignment or sufficient data (Guo et al., 2020, Gupta et al., 2020)

Novel schemes (e.g., MemoDetector, ALFRED) consistently outperform naïve fusion or unimodal baselines, especially when incorporating semantic mining, emotion-enriched representations, and selective attention mechanisms.

4. Objective Functions, Training Protocols, and Interpretability

Loss functions across MEU span:

Standard categorical/binary cross-entropy and macro-F1 aggregation for multi-class/multi-label tasks (Sharma et al., 15 Mar 2024, Zheng et al., 16 Mar 2025)
Cluster/contrastive loss for aligning image and text representations, as in dual-semantic guided training (Zheng et al., 16 Mar 2025)
Regularization: label smoothing, logit adjustment for class imbalance, KL divergence for teacher–student distillation (Tang et al., 2023)
Hard mining: incorporating LLM-mined hard samples to improve decision boundaries and drive emotion feature separability (Tzelepi et al., 14 Apr 2025)

Interpretability is addressed via:

Grad-CAM visualization (ALFRED) to expose salient visual regions guiding classification (e.g., cat’s frowning eyes for anger)
Human-in-the-loop evaluation: user studies querying meme clustering and emotional label agreement, averaging $\sim$ 67% alignment between model and annotator clusters (Konyspay et al., 21 Mar 2025)
Cross-modal attention mapping and token-level visualization for explainability in sequence generation and meme selection (Fei et al., 2021)

In error diagnosis, models frequently struggle with:

Cross-modal dissociation, where text and image project opposing or orthogonal affective signals
Sarcasm and metaphor, often requiring cultural or world-knowledge external to the meme itself

5. Applications in Dialogue, Moderation, and Emotional AI

MEU underpins several downstream applications:

Conversational AI: Integration into open-domain dialogue systems via explicit emotion modeling enables emotionally resonant responses and meme recommendation. For instance, PLATO-2-based systems employ emotion flow mechanisms and emotion description prediction to boost emotion classification accuracy in dialogue contexts (+2.6 percentage points over base models), and improve recall of appropriate meme suggestion by >10 points even for unseen memes (Lu et al., 2022).
Content moderation: Leveraging MLLM-derived emotion annotations facilitates better performance in harmful meme detection; adding emotion embeddings increases accuracy by up to 1% on challenging datasets, surpassing SOTA (Tzelepi et al., 14 Apr 2025).
Bilingual and cross-cultural MEU: Bilingual architectures (MGMCF on MET-MEME) utilize object-level semantic mining and global-local attention to improve sentiment, metaphor, and offensiveness detection in both English and Chinese memes by up to 4% absolute accuracy (Zheng et al., 16 Mar 2025).
Recommendation, personalization, and affective computing: Multi-modal emotion clusters and nuanced emotion detection augment real-time recommendation and enhance user experience on social platforms (Konyspay et al., 21 Mar 2025).

6. Limitations, Open Challenges, and Future Directions

Persisting obstacles in MEU include:

Perception–cognition–empathy gap: Leading models excel at low-level recognition but degrade in tasks requiring intent recognition, sarcasm identification, and deep empathy. Sarcasm remains the “greatest hurdle,” often necessitating explicit contradiction detection and Theory-of-Mind reasoning (Li et al., 14 Sep 2025).
Cross-modal noise and dissociation: Models are distracted by irrelevant modality-specific cues; identical images with altered text may signal entirely different emotions (Sharma et al., 15 Mar 2024, Shi et al., 14 Nov 2025).
Class and data imbalance: Extreme skew toward certain emotions (joy, anger) impedes rare and subtle label acquisition, while limited meme diversity (cultural, linguistic) curtails generalization.
World knowledge and cultural context: Memes frequently rely on contemporary or subcultural references; current architectures underperform in the absence of external knowledge integration (Sharma et al., 2020).

Key recommendations are:

Dataset scaling, emotion diversification, and modality enrichment (videos, GIFs, audio) (Li et al., 14 Sep 2025)
Benchmarks incorporating non-Western meme data, addressing cultural transfer and bias (Li et al., 14 Sep 2025)
Automated semantic comparison metrics surpassing cosine+LLM-judge (e.g., BLEURT, COMET) and conversational probing for deeper reasoning chains (Li et al., 14 Sep 2025)
Explicit integration of commonsense/world-knowledge modules and Theory-of-Mind training objectives (Shi et al., 14 Nov 2025, Li et al., 14 Sep 2025)
Multi-task and continual learning to unify objective visual and subjective emotional task domains (Shi et al., 14 Nov 2025, Li et al., 14 Sep 2025)

7. Summary and Impact

MEU has matured from coarse-grained sentiment analysis to comprehensive, hierarchical emotion understanding using multimodal, high-fidelity benchmarks and multi-stage architecture. The field is converging toward unified frameworks that embed, enhance, and fuse information across modalities, mine implicit meanings, and perform hierarchical reasoning from perception to empathy. Recent advances such as ALFRED, MemoDetector, MGMCF, and hierarchical benchmarks like EmoBench-Reddit establish robust protocols and clear diagnostics for next-generation emotionally intelligent multimodal AI. These developments lay the methodological foundations for broader adoption in moderation, conversational AI, and affect-centric media analysis, while highlighting open problems in sarcasm comprehension, rare emotion handling, and cultural transfer (Li et al., 14 Sep 2025, Shi et al., 14 Nov 2025, Sharma et al., 15 Mar 2024, Tzelepi et al., 14 Apr 2025, Zheng et al., 16 Mar 2025).