Papers
Topics
Authors
Recent
2000 character limit reached

AVMeme Exam: Multimodal Meme Benchmark

Updated 4 February 2026
  • AVMeme Exam is a human-curated benchmark assessing multimodal cultural and contextual understanding through internet audio-visual memes.
  • It features 1,032 rigorously selected clips with diverse modalities and languages to evaluate surface, emotional, and pragmatic comprehension.
  • Evaluation protocols compare human and model performance across various question types while exposing gaps in textless audio interpretation and deep inference.

AVMeme Exam is a human-curated benchmark developed for evaluating the contextual and cultural understanding capabilities of multimodal LLMs (MLLMs) using iconic internet audio-visual memes. It is designed to assess surface, contextual, emotional, and pragmatic comprehension across a broad spectrum of linguistic, auditory, and cultural domains (Jiang et al., 25 Jan 2026).

1. Dataset Structure and Composition

The AVMeme Exam dataset consists of 1,032 ("meme-full") rigorously curated audio-visual clips, selected to probe a diverse range of multimodal understanding tasks. After removal of "text-cheat" items—memes whose content is fully recoverable by text-only models—the primary evaluation split ("meme-main") contains 846 items.

Each meme is categorized into one of four audio-centric modalities, with approximate proportions:

Modality Proportion of N (%) Description
Speech (Sp) ~35 Spoken language clips
Songs (So) ~26 Clips with lyrics or singing
Music (Mu, textless) ~23 Nonverbal instrumental music
Sound Effects (Sfx) ~16 Isolated non-musical audio cues

Memes span over ten languages, including English, Mandarin Chinese, Japanese, Korean, Persian (Farsi), and several less-common tongues, reflecting contributions from researchers with backgrounds rooted in the U.S., China, Japan, India, the Middle East, and elsewhere. Five major global regions are covered: North America, Europe, Middle East, East Asia, and South Asia.

2. Question and Answer Design Framework

Every clip in AVMeme Exam is paired with a single multiple-choice question (MCQ), curated to assess a specific dimension of multimodal understanding. Each question is independently labeled by human verifiers along one of seven understanding levels:

Level Description (with example)
Audio Analysis (A) Surface-level audio features ("What processing was applied to the voice?")
Language Analysis (L) Literal linguistic content ("What does the speaker claim about himself?")
Contextual Inference (C) Pragmatic context ("Which situation matches the tone and intent?")
Emotion Analysis (E) Affect and feeling ("What will listeners feel on hearing this?")
Humor/Popularity (H) Social/cultural signal ("Which is not a reason the clip is humorous?")
Usage/Application (U) Norms/pragmatic use ("When is this meme typically used?")
World Knowledge (W) External facts ("Who performs the original version of this track?")

Each MCQ comprises 4–6 randomly shuffled options. Model and human performance is measured per category as the fraction of correct responses: acci=1Nij=1Ni1[y^j=yj],Am=ipiacci\text{acc}_i = \frac{1}{N_i}\sum_{j=1}^{N_i} \mathbb{1}[\hat{y}_j = y_j], \quad A_m = \sum_i p_i \,\text{acc}_i where NiN_i is the count of items in category ii and pip_i is its proportion.

3. Metadata Annotation and Signal Control

Each meme is extensively annotated with standardized metadata:

  • Source URL, onset/offset timestamps, and year of origin
  • Language and sound category (Sp, So, Mu, Sfx)
  • Transcript (when applicable)
  • Concise human-written summary
  • Typical usage description
  • Emotion tags (e.g., happy, ironic, nostalgic)
  • Sensitivity tags (sex, violence, race/gender, etc.)
  • Visual-hint tags (no text, transcription subtitle, title/name, visual_cheat)

Metadata guides question construction and enables analysis stratified by year, sensitivity, and visual hinting. Strict controls are implemented to prevent leakage of explicit textual cues; ablation studies indicate that presence of meme names or subtitles (“visual_cheat”) can inflate model accuracy by 10–40%.

4. Evaluation Protocol and Metrics

Evaluation involves both SOTA MLLMs and human baselines. Models are partitioned into:

  • Audio-only LLMs (SALMONN, Qwen2-Audio, Audio Reasoner, Kimi-Audio, Audio Flamingo 3, Step-Audio 2-mini, MiDashengLM, MiMo-Audio, Music Flamingo, GPT-4o Audio)
  • Audio-visual LLMs (Baichuan-Omni, MiniCPM, Phi-4-Multimodal, Qwen2.5-Omni, Gemma 3n, Qwen3-Omni, Gemini 2.5 Flash, Gemini 3 Flash, Gemini 3 Pro)

Twenty native speakers (10 English, 10 Chinese) serve as the human baseline, each answering ~37–38 items, filtered by language and format.

Audio is standardized to mono 16 kHz WAV (≤30s); video is converted to 360p @1fps MP4. MCQ options are formatted for randomized presentation. Model input strictly uses raw byte files to prevent meta-data leakage.

Primary metric is accuracy; standard metrics (precision, recall, F1) are acknowledged though not emphasized. Human performance is further stratified by prior exposure ("familiar" vs. "novice" items).

A paired t-test is prescribed for significance testing between accuracy vectors across N clips: t=dˉsd/N,di=accihaccimt = \frac{\bar{d}}{s_d/\sqrt{N}}, \quad d_i = \text{acc}_i^h - \text{acc}_i^m where sds_d is the sample standard deviation of differences.

5. Empirical Results and Observations

Performance is assessed across question type, neural modality, language, and human/model comparison.

  • By Question Category: Surface-level language tasks (L) achieve >90% with top models. Audio analysis (A) drops to ~50–60%. Deep context (C), emotion (E), usage (U), humor (H), and world knowledge (W) are substantially more difficult, with accuracies spanning 20–70%. Gemini 3 Pro achieves the highest overall: 80.0% on "meme-main" with visual input.
  • Modality and Language Effects: Speech and song yield the highest accuracies (70–90%), while instrumental music and sound effects remain problematic (30–60%) even for leading models. Models perform best on English and Chinese; Japanese, Korean, and especially Persian impose greater difficulty.
  • Human vs. Model: Human familiarity with a meme correlates with top performance (avg ≈ 85%). Leading models match or slightly exceed human novice scores on familiar items, but broad human competence remains superior, especially on unfamiliar clips.
  • Temporal and Cultural Variation: Performance peaks for memes originating from 1980–2000; accuracy dips for older (<1980) or very recent (>2020) items, indicating possible distributional biases in pretraining data. LLMs underperform for memes in less-represented languages or with implicit cultural references.

6. Diagnostic Analyses and Gaps

Several systematic trends are evident:

  • Textless Audio Bottleneck: Loss of linguistic cues in instrumental music and sound effects degrades model accuracy by 20–30% relative to speech or song. This reflects current MLLMs' reliance on language for inference and highlights limitations in prosodic, timbral, and genre-level comprehension.
  • Surface vs. Deep Understanding: All models exhibit a performance hierarchy: L > A > {C, E} > {H, U} > W. Tasks involving context, emotion, usage, and world knowledge remain substantially below surface-level comprehension even in best-in-class systems (with drops of 15–30% for Gemini 3 Pro from L to pragmatic categories).
  • Cheat Pathways: Allowing visual or on-screen textual hints artificially boosts scores; elimination of such cues is necessary for assessing genuine multimodal grounding.

7. Outlook and Prospects

The AVMeme Exam demonstrates a fundamental gap in current MLLM capability: effective surface content and language parsing, but persistent deficits in textless audio interpretation, contextual reasoning, emotion recognition, pragmatic usage, and world knowledge.

Recommended directions for bridging these gaps include:

  • Expansion of audio training corpora to encompass both non-linguistic sounds and broader cultural artifacts.
  • Supervision that targets emotion, typical use, and pragmatic context, emphasizing human-centered annotation.
  • Objective function and architectural refinements prioritizing inference and contextualization over shallow recognition.
  • Extension of the benchmark to further linguistic, cultural, and generational categories, and the development of generative, open-ended evaluation protocols complementing multiple-choice assessment.

AVMeme Exam provides not only a comprehensive diagnostic tool for multimodal intelligence but also a structured research agenda for advancing culturally and contextually aligned LLMs (Jiang et al., 25 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AVMeme Exam.