MM-IMDb Genre Classification

Updated 25 June 2026

MM-IMDb Genre Classification is a multimodal, multi-label task that uses diverse inputs to predict movie genres.
It integrates advanced fusion architectures, attention mechanisms, and knowledge-graph reasoning to mitigate class imbalance and modality heterogeneity.
Recent studies demonstrate significant gains using contrastive learning and modality-specific feature extraction, setting new performance benchmarks.

Multimodal IMDb (MM-IMDb) genre classification denotes a multi-label classification task that leverages heterogeneous modalities—primarily plot summaries, posters, trailers, subtitles, and structured metadata—to predict movie genre(s). Because each movie can exhibit multiple genres simultaneously and the input modalities are structurally diverse, MM-IMDb genre classification is a canonical challenge for multimodal learning, structured reasoning, and handling severe class imbalance. Research on MM-IMDb has led to innovations in fusion architectures, knowledge-graph-based reasoning, robust attention mechanisms, and advanced contrastive objectives, with datasets and methodology continuously evolving to address new challenges in the domain.

1. Datasets and Benchmark Characteristics

The MM-IMDb framework, first instantiated on a dataset of 25,959 movies spanning 15 genres, provides each instance with a movie poster (image), plot summary (text), and metadata (title, director, cast, genre labels). Class labels are highly imbalanced, with genres such as "Drama" vastly outnumbering minor or less frequent categories like "Film-Noir" (e.g., 65:1 in MM-IMDb 2.0) (Li et al., 2023). The MM-IMDb 2.0 extension increases sample count by ~30% to 33,742 movies while retaining the same genres, thereby intensifying the imbalance and heterogeneity.

Other datasets following the MM-IMDb paradigm extend modality coverage. For example, (Mangolin et al., 2020) reports a dataset of 10,594 titles, with each sample containing aligned English plot synopsis, subtitles, poster images, and trailer clips, along with 18 genre labels. The MM-IMDb poster subset, as described in (Nareti et al., 2024), focuses exclusively on visual and poster-extracted textual features, encompassing 13,882 posters across 4,464 movies and 13 genres, with up to five posters per movie and pronounced label skew. Curation often involves rigorous matching across APIs (TMDb, IMDb, OpenSubtitles, YouTube) and strict inclusion criteria to guarantee multimodality per example.

2. Modality-Specific Feature Extraction

Feature extraction protocols are modality-adapted and optimized for subsequent fusion:

Text (plot, synopsis, subtitles): Long Short-Term Memory (LSTM) encoders are pretrained on large corpora for capturing temporal dependencies in narrative structure. Term Frequency-Inverse Document Frequency (TF-IDF) representations, random projection dimensionality reduction, and pretrained word embeddings (e.g., Word2Vec, $d=300$ ) are common for baseline textual features (Mangolin et al., 2020).
Poster (image): Deep visual descriptors such as CLIP ViT-B/32 encoders ($512$-D) (Li et al., 2023, Nareti et al., 2024) or Inception-v3 (2,048-D) (Mangolin et al., 2020) are used for extracting high-level semantic embeddings. Handcrafted descriptors (LBP, RGB histograms) provide ancillary representations (Mangolin et al., 2020).
Trailer and Video: 3D ConvNets (C3D) yield temporal-spatial representations (e.g., fc6, $4,096$-D). LRCN and CTT-MMC blend 2D features with sequence modeling capacity (Mangolin et al., 2020).
Audio: Mel Frequency Cepstral Coefficients (MFCCs), Statistical Spectrum Descriptors, and LBP-applied spectrograms enrich temporal and spectral coverage from trailer audio channels (Mangolin et al., 2020).
Metadata/Knowledge Graph (KG): Metadata (directors, cast, titles, genres) are ingested into a knowledge graph, which is then embedded via translation-based methods such as RotatE or TransH. Entity-relation triplets capture co-occurrence structure (e.g., director–title, cast–genre). Embedding dimension $D_k=200$ is typical, and the feature for each movie is aggregated over available entity embeddings (Li et al., 2023).
Poster Text (OCR): Posters are processed with OCR (e.g., Gemini-Pro-Vision), and subword tokens are passed through a CLIP text encoder to obtain $512$-D embeddings. While most posters are visually driven, textual tokens occasionally contribute salient cues (Nareti et al., 2024).

3. Multimodal Fusion Architectures

Contemporary fusion strategies range from late fusion of unimodal predictions to highly integrated attention-based frameworks:

Simple Fusion: Scores from unimodal models are combined via sum, product, or max rules, with empirically tuned thresholds (e.g., $\theta=0.3$ for sum, $\theta=0.01$ for product). This method shows strong complementarity when synopsis-LSTM and trailer-video encoders are fused, achieving $F_1=0.628$ , AUC $_{PR}=0.664$ (Mangolin et al., 2020).
Attention-Weighted Fusion: Each modality's projected feature is assigned a scalar attention weight by a shared linear network (sigmoid activation), producing a fused vector:

$F_i^{fu} = A_i^I h(F_i^I) + A_i^T h(F_i^T) + A_i^K h(F_i^K)$

where $512$0, $512$1, $512$2 are attention weights for image, text, and KG respectively (Li et al., 2023).

Knowledge-Guided Fusion: KG-derived embeddings provide relational priors that are injected jointly with visual/text/textual features, enhancing the fused representations with entity-based signals. The Knowledge Graph feature is either concatenated or attention-weighted into the fusion.
Cross-Attention Fusion: Poster-specific architectures leverage cross-attention between image and text embeddings (MCAM: Multi-Head Cross Attention Module):

$512$3

Outputs are then combined, passed through sequential self-attention layers, and merged into a final embedding for classification (Nareti et al., 2024).

4. Specialized Learning Modules

Advanced modules are introduced to address persistent challenges in MM-IMDb genre prediction:

Attention Teacher (AT): To regularize and stabilize per-modality attention allocation, a self-supervised pseudo-label $512$4 (derived from counts of directors/casts and their graph degrees) supervises the KG attention weight $512$5 via a logarithmic loss, thereby preventing attention drift and benefitting all modalities through shared parameters (Li et al., 2023).
Genre-Centroid Anchored Contrastive Learning (G-CACL): Standard contrastive learning is incompatible with multi-label targets. G-CACL addresses this by constructing per-sample positive anchors as centroids of true genre embeddings:

$512$6

Batch negatives are drawn from the union of all genre labels except the true labels, and a contrastive loss sharpens fused embedding discriminability. Empirically, this approach enhances both head and tail genre performance (raising tail macro-F1 from ≈20% to ≈75–85%) (Li et al., 2023).

Losses for Imbalance: Asymmetric Loss (ASL) is applied when label support is skewed, with exponents $512$7, $512$8, and margin $512$9 for positive and negative terms, countering dominance by head genres (Nareti et al., 2024).

5. Baseline Methods and Evaluation Protocols

Methods are benchmarked on standard splits: 60/10/30% (train/val/test) for MM-IMDb and MM-IMDb 2.0, or 80/10/10% (train/val/test) on poster-only collections. Evaluated metrics include Micro-F1, Macro-F1, Weighted-F1, Sample-wise F1, Hamming Loss, and Balanced Accuracy. Ablations gauge each module's impact.

Summary of main results (micro-F1, unless specified):

Method	MM-IMDb (Full)	MM-IMDb 2.0	Poster Subset (Fₘ)
IDKG (Li et al., 2023)	0.849	0.828	–
BridgeTow (prior SOTA)	0.682	–	–
MM-GATBT (graphical)	–	0.674	–
Simple SOTA (Mangolin et al., 2020)	0.628*	–	–
Cross-Attn (Nareti et al., 2024)	–	–	0.6823
CLIP bimodal baseline	–	–	0.6481
CentralNet/GMU (poster)	–	–	≤0.5359

*A fusion of synopsis-LSTM and trailer-C3D.

Gains from knowledge-guided and attention-regularized models are consistently ~15–20 percentage points above prior SOTA.

6. Key Insights and Modality Contributions

LSTM-encoded plot synopses consistently provide the strongest unimodal performance (F$4,096$0 ≈ 0.49, AUC$4,096$1 ≈ 0.68) (Mangolin et al., 2020). Poster images, while informative, are less discriminative in isolation (F$4,096$2 ≈ 0.41). Deep visual and audio descriptors supply complementary signals but are rarely dominant alone. Cross-modal fusion—especially when architected with explicit attention, contrastive learning, and knowledge-graph priors—yields significant additive and synergistic performance improvements.

Knowledge graphs are especially valuable for rare genres and enriching semantic relations absent from raw data. Attention mechanisms, when guided by pseudo-labels, mitigate the risk of unsupervised allocation entropy and enable adaptive weighting of modalities according to instance-specific signal richness (Li et al., 2023). Contrastive objectives tied to genre centroids enhance representation clustering and tail-class coverage. OCR-derived poster text rarely suffices alone but, in cross-attention, can enrich image-based genre cues (Nareti et al., 2024).

7. Limitations and Future Directions

Predominant limitations include incomplete KG coverage at test time (many entities, especially directors/casts, are unseen and thus mapped to zero vectors), restricted entity and relation types in current KGs, and limited incorporation of video/auditory modalities into end-to-end fusion pipelines (Li et al., 2023). Future work advocates inductive entity embedding, construction of multimodal KGs (extending beyond metadata), integration of video-based trailers and audio tracks, and end-to-end joint pretraining regimes. Extensions to recommendation and summarization are also suggested as natural applications. There is a plausible implication that expanding modality diversity and KG expressivity will further enhance both genre coverage and downstream transferability.