Multi-modal Keyphrase Prediction

Updated 13 October 2025

MMKP is a framework that fuses text and visual inputs to automatically extract and generate keyphrases for improved document summarization.
It utilizes encoder–decoder architectures with strategies like One2One, One2Seq, and One2Set, incorporating cross-modal attention and noise filtering for robust performance.
By combining generative and copying mechanisms with chain-of-thought reasoning, MMKP systems achieve up to 20–30% F1 improvement in predicting absent keyphrases on benchmark datasets.

Multi-modal Keyphrase Prediction (MMKP) refers to the task of automatically generating or extracting concise, salient phrases that summarize documents with inputs from multiple modalities—most commonly, text and images. Unlike traditional keyphrase prediction, which relies solely on textual content, MMKP seeks to leverage complementary information present in visual elements, as well as other media, to improve the fidelity, relevance, and diversity of keyphrase outputs. The domain has seen rapid methodological evolution, transitioning from classic extractive algorithms to sophisticated encoder–decoder neural frameworks and, more recently, the integration of cross-modal reasoning in large-scale vision-LLMs.

1. Evolution of Keyphrase Prediction Paradigms

Keyphrase prediction research historically centered on extractive methods—algorithms that rank and select n-grams occurring in the text (e.g., TF-IDF, TextRank, RAKE). While such approaches achieved utility in information retrieval and scientific indexing, they could not produce keyphrases absent from the source text, nor could they model underlying semantic relationships. The advent of encoder–decoder architectures and sequence-to-sequence (seq2seq) learning, as highlighted in "Deep Keyphrase Generation" (Meng et al., 2017) and surveys (Çano et al., 2019, Xie et al., 2023), enabled truly abstractive systems capable of generating unseen keyphrases from dense semantic representations. Mechanisms including attention, copying/pointer networks, and coverage terms paved the way for robust generation modules, outperforming extraction baselines by up to 20% F1 improvement in present keyphrase prediction and enabling recall of 8–15% for absent keyphrases at top-50 ranking.

More recent deep learning models adopt three main paradigms:

One2One: Single document–keyphrase pairs, neglecting inter-keyphrase dependencies.
One2Seq: Concatenation of all document keyphrases into a token sequence, introducing order bias.
One2Set: Treats the set of keyphrases as unordered; employs parallel decoding and assignment algorithms (e.g., Hungarian algorithm or optimal transport in (Shao et al., 4 Oct 2024)) to align predictions, thus alleviating training difficulties and order bias.

Models such as CopyRNN, CatSeq, and SetTrans exemplify these paradigms, with empirical evidence showing superior optimization and predictive performance for One2Set strategies.

MMKP systems generalize textual encoder–decoder frameworks by incorporating modality-specific feature extractors. Text is typically encoded using RNNs (GRU, LSTM), but for images,

CNNs or region-based detectors (e.g., VGG, Faster-RCNN) extract global and region-level visual features.
Image attributes (semantic tags derived via attribute predictors) and OCR tokens are appended to text encodings to bridge the modality gap (Wang et al., 2020, Dong et al., 2023).

Fusion is achieved by cross-modal attention mechanisms, where encoded representations from text and image are aligned or jointly attended. Multi-Head Attention constructs as in M³H-Att (Wang et al., 2020) model complex pairwise interactions, enabling simultaneous focus on multiple granularities (e.g., object regions vs. textual tokens). Noise filtering modules further refine visual inputs by scoring image–text matching and region–text correlation, sometimes employing externally sourced visual entities for enriched semantic cues (Dong et al., 2023).

Modality	Feature Encoder	Fusion Mechanism
Text	Bi-GRU, Transformer, LLM	Attn, Copy, Type embeddings
Image	VGG, Faster-RCNN, OCR	Multimodal attention, filtering
External Entity	API attributes, OCR	Sequence concat, type tokens

This integration allows MMKP systems to exploit non-textual clues for both present and absent keyphrase prediction.

3. Algorithms and Training Objectives

Multi-modal encoder–decoder frameworks are trained to optimize generation, copying, and classification losses:

Generative probability: $p_g(y_t | y_{1:t-1}, x)$ from decoder softmax or pointer network.
Copying probability: $p_c(y_t | y_{1:t-1}, x)$ sums attention over source text or visual OCR tokens.
Unified objective: Weighted sum, e.g., $p(y_j) = \lambda p_p(y_j) + (1-\lambda)p_c(y_j)$ , with $\lambda$ adaptively determined.
Matching and correlation scores in image–text filtering: $A = \mathrm{FFN}((\overline{H}_T \cdot \overline{H}_I^T)/\sqrt{d_2} + J \times s_c)$ for region-wise alignment.
Supervision signal assignment: Optimal transport formulations match ground-truth keyphrases to control codes for parallel decoding (Shao et al., 4 Oct 2024), promoting higher recall and better supervision signal distribution.

Loss functions include token-level cross-entropy for generation, cross-modal classification, and region-text divergence to guide image region selection.

Objective	LaTeX/Formula
Precision@k	$P@k = \frac{\|\hat{Y}_{:k} \cap Y\|}{\|\hat{Y}_{:k}\|}$
Unified generation probability	$p(y_j) = \lambda p_p(y_j) + (1-\lambda)p_c(y_j)$
Optimal transport for One2Set	$\min \sum_{i,j} c_{ij} \pi_{ij}$ , s.t. $\sum_i \pi_{ij} = d_j$ , $\sum_j \pi_{ij} = s_i$
Dynamic CoT loss (Ma et al., 10 Oct 2025)	$\mathcal{L}_d = -\frac{1}{T} \sum_{t=1}^T \log P(y_t^d \| y_{<t}^d, v; \theta)$

4. Vision-LLMs and Chain-of-Thought Reasoning

The latest advances leverage vision-LLMs (VLMs) pretrained on large multi-modal corpora (Ma et al., 10 Oct 2025). These models possess strong generalization capacity across seen, absent, and unseen keyphrase scenarios.

Zero-shot VLMs provide a lower bound, showing limited absent/unseen performance.
Supervised fine-tuning (SFT) allows direct autoregressive training for MMKP, optimizing cross-entropy over concatenated prompts and keyphrase outputs.
Fine-tune-CoT introduces chain-of-thought reasoning with teacher-generated rationales, enabling better cross-modal understanding and improved absent/unseen scenario prediction, at the expense of computational and inference efficiency.
Dynamic CoT strategy adaptively applies chain-of-thought training to hard samples (loss below threshold $\gamma$ ), using standard supervision elsewhere to balance overthinking and reasoning depth.

These strategies are empirically shown to yield up to 20–30% F1 improvement for unseen keyphrase prediction on the MMKP-360k dataset.

5. Datasets, Evaluation Protocols, and Benchmarking

Key datasets include KP20k, Inspec, NUS, Krapivin, SemEval, TRC, and large-scale Twitter multi-modal collections (Çano et al., 2019, Xie et al., 2023, Dong et al., 2023, Wang et al., 2020). Evaluations use macro-averaged Precision, Recall, and F1@k; for variable numbers, F1@O and F1@M variants are used. The distinction between present and absent keyphrase accuracy is emphasized, with special focus on absent (non-verbatim) prediction and unseen (out-of-training) generalization.

Benchmarking reveals the inadequacy of models trained/evaluated on datasets with high train–test keyphrase overlap, which inflates perceived model performance and fails to test cross-modal reasoning or novelty. A plausible implication is that true progress in MMKP requires carefully designed splits and resampling to challenge models' ability to predict unseen or semantically derived keyphrases (Ma et al., 10 Oct 2025).

6. Challenges, Innovations, and Emerging Directions

Key challenges for MMKP:

Alignment: Ensuring cross-modal semantic correspondence, especially with noisy/irrelevant image regions.
Absent/unseen keyphrases: Achieving robust generation for phrases not explicitly present in the source modalities.
Overfitting to present keyphrases, especially in benchmarks with high overlap.
Noise and redundancy: Filtering irrelevant visual regions and reducing semantic repetition in generated keyphrase sets.

Innovations such as multi-granularity image noise filtering (Dong et al., 2023), visual entity enhancement via externally sourced semantic descriptors, and dynamic cross-modal fusion strategies address these issues. Optimal transport and sequence labeling via LLM selectors increase recall and precision (Shao et al., 4 Oct 2024). Chain-of-thought dynamic learning in VLMs further boosts reasoning for challenging cases (Ma et al., 10 Oct 2025).

A plausible implication is that future work will increasingly fuse multi-modal reasoning modules (including machine rationale generation), robust matching/filtering, and large, diverse datasets with low train–test overlap to produce high-quality, contextually rich keyphrases useful for retrieval, recommendation, and automated content organization.

7. Practical Applications and Impact

MMKP systems offer tangible benefits for indexing, retrieval, and recommendation in multi-modal environments—especially on social media platforms where both text and imagery inform user intent. Automated hashtag generation, enriched metadata, and theme extraction aid both end-users and content providers in navigating increasingly complex media landscapes. Precision improvements, particularly for absent and semantic keyphrases, directly translate to higher relevance in search and recommendation, while noise filtering and redundancy reduction enhance user-facing summarization. Public codebases such as https://github.com/bytedance/DynamicCoT (Ma et al., 10 Oct 2025) and https://github.com/DeepLearnXMU/MM-MKP (Dong et al., 2023) exemplify practical resources for further development and deployment of MMKP frameworks.