AVMeme Exam: Multimodal Meme Benchmark
- AVMeme Exam is a human-curated benchmark assessing multimodal cultural and contextual understanding through internet audio-visual memes.
- It features 1,032 rigorously selected clips with diverse modalities and languages to evaluate surface, emotional, and pragmatic comprehension.
- Evaluation protocols compare human and model performance across various question types while exposing gaps in textless audio interpretation and deep inference.
AVMeme Exam is a human-curated benchmark developed for evaluating the contextual and cultural understanding capabilities of multimodal LLMs (MLLMs) using iconic internet audio-visual memes. It is designed to assess surface, contextual, emotional, and pragmatic comprehension across a broad spectrum of linguistic, auditory, and cultural domains (Jiang et al., 25 Jan 2026).
1. Dataset Structure and Composition
The AVMeme Exam dataset consists of 1,032 ("meme-full") rigorously curated audio-visual clips, selected to probe a diverse range of multimodal understanding tasks. After removal of "text-cheat" items—memes whose content is fully recoverable by text-only models—the primary evaluation split ("meme-main") contains 846 items.
Each meme is categorized into one of four audio-centric modalities, with approximate proportions:
| Modality | Proportion of N (%) | Description |
|---|---|---|
| Speech (Sp) | ~35 | Spoken language clips |
| Songs (So) | ~26 | Clips with lyrics or singing |
| Music (Mu, textless) | ~23 | Nonverbal instrumental music |
| Sound Effects (Sfx) | ~16 | Isolated non-musical audio cues |
Memes span over ten languages, including English, Mandarin Chinese, Japanese, Korean, Persian (Farsi), and several less-common tongues, reflecting contributions from researchers with backgrounds rooted in the U.S., China, Japan, India, the Middle East, and elsewhere. Five major global regions are covered: North America, Europe, Middle East, East Asia, and South Asia.
2. Question and Answer Design Framework
Every clip in AVMeme Exam is paired with a single multiple-choice question (MCQ), curated to assess a specific dimension of multimodal understanding. Each question is independently labeled by human verifiers along one of seven understanding levels:
| Level | Description (with example) |
|---|---|
| Audio Analysis (A) | Surface-level audio features ("What processing was applied to the voice?") |
| Language Analysis (L) | Literal linguistic content ("What does the speaker claim about himself?") |
| Contextual Inference (C) | Pragmatic context ("Which situation matches the tone and intent?") |
| Emotion Analysis (E) | Affect and feeling ("What will listeners feel on hearing this?") |
| Humor/Popularity (H) | Social/cultural signal ("Which is not a reason the clip is humorous?") |
| Usage/Application (U) | Norms/pragmatic use ("When is this meme typically used?") |
| World Knowledge (W) | External facts ("Who performs the original version of this track?") |
Each MCQ comprises 4–6 randomly shuffled options. Model and human performance is measured per category as the fraction of correct responses: where is the count of items in category and is its proportion.
3. Metadata Annotation and Signal Control
Each meme is extensively annotated with standardized metadata:
- Source URL, onset/offset timestamps, and year of origin
- Language and sound category (Sp, So, Mu, Sfx)
- Transcript (when applicable)
- Concise human-written summary
- Typical usage description
- Emotion tags (e.g., happy, ironic, nostalgic)
- Sensitivity tags (sex, violence, race/gender, etc.)
- Visual-hint tags (no text, transcription subtitle, title/name, visual_cheat)
Metadata guides question construction and enables analysis stratified by year, sensitivity, and visual hinting. Strict controls are implemented to prevent leakage of explicit textual cues; ablation studies indicate that presence of meme names or subtitles (“visual_cheat”) can inflate model accuracy by 10–40%.
4. Evaluation Protocol and Metrics
Evaluation involves both SOTA MLLMs and human baselines. Models are partitioned into:
- Audio-only LLMs (SALMONN, Qwen2-Audio, Audio Reasoner, Kimi-Audio, Audio Flamingo 3, Step-Audio 2-mini, MiDashengLM, MiMo-Audio, Music Flamingo, GPT-4o Audio)
- Audio-visual LLMs (Baichuan-Omni, MiniCPM, Phi-4-Multimodal, Qwen2.5-Omni, Gemma 3n, Qwen3-Omni, Gemini 2.5 Flash, Gemini 3 Flash, Gemini 3 Pro)
Twenty native speakers (10 English, 10 Chinese) serve as the human baseline, each answering ~37–38 items, filtered by language and format.
Audio is standardized to mono 16 kHz WAV (≤30s); video is converted to 360p @1fps MP4. MCQ options are formatted for randomized presentation. Model input strictly uses raw byte files to prevent meta-data leakage.
Primary metric is accuracy; standard metrics (precision, recall, F1) are acknowledged though not emphasized. Human performance is further stratified by prior exposure ("familiar" vs. "novice" items).
A paired t-test is prescribed for significance testing between accuracy vectors across N clips: where is the sample standard deviation of differences.
5. Empirical Results and Observations
Performance is assessed across question type, neural modality, language, and human/model comparison.
- By Question Category: Surface-level language tasks (L) achieve >90% with top models. Audio analysis (A) drops to ~50–60%. Deep context (C), emotion (E), usage (U), humor (H), and world knowledge (W) are substantially more difficult, with accuracies spanning 20–70%. Gemini 3 Pro achieves the highest overall: 80.0% on "meme-main" with visual input.
- Modality and Language Effects: Speech and song yield the highest accuracies (70–90%), while instrumental music and sound effects remain problematic (30–60%) even for leading models. Models perform best on English and Chinese; Japanese, Korean, and especially Persian impose greater difficulty.
- Human vs. Model: Human familiarity with a meme correlates with top performance (avg ≈ 85%). Leading models match or slightly exceed human novice scores on familiar items, but broad human competence remains superior, especially on unfamiliar clips.
- Temporal and Cultural Variation: Performance peaks for memes originating from 1980–2000; accuracy dips for older (<1980) or very recent (>2020) items, indicating possible distributional biases in pretraining data. LLMs underperform for memes in less-represented languages or with implicit cultural references.
6. Diagnostic Analyses and Gaps
Several systematic trends are evident:
- Textless Audio Bottleneck: Loss of linguistic cues in instrumental music and sound effects degrades model accuracy by 20–30% relative to speech or song. This reflects current MLLMs' reliance on language for inference and highlights limitations in prosodic, timbral, and genre-level comprehension.
- Surface vs. Deep Understanding: All models exhibit a performance hierarchy: L > A > {C, E} > {H, U} > W. Tasks involving context, emotion, usage, and world knowledge remain substantially below surface-level comprehension even in best-in-class systems (with drops of 15–30% for Gemini 3 Pro from L to pragmatic categories).
- Cheat Pathways: Allowing visual or on-screen textual hints artificially boosts scores; elimination of such cues is necessary for assessing genuine multimodal grounding.
7. Outlook and Prospects
The AVMeme Exam demonstrates a fundamental gap in current MLLM capability: effective surface content and language parsing, but persistent deficits in textless audio interpretation, contextual reasoning, emotion recognition, pragmatic usage, and world knowledge.
Recommended directions for bridging these gaps include:
- Expansion of audio training corpora to encompass both non-linguistic sounds and broader cultural artifacts.
- Supervision that targets emotion, typical use, and pragmatic context, emphasizing human-centered annotation.
- Objective function and architectural refinements prioritizing inference and contextualization over shallow recognition.
- Extension of the benchmark to further linguistic, cultural, and generational categories, and the development of generative, open-ended evaluation protocols complementing multiple-choice assessment.
AVMeme Exam provides not only a comprehensive diagnostic tool for multimodal intelligence but also a structured research agenda for advancing culturally and contextually aligned LLMs (Jiang et al., 25 Jan 2026).