Multimodal Stance Detection (MSD)

Updated 15 November 2025

MSD is the task of inferring an author’s stance from text, images, and contextual cues, enabling analysis of complex multimodal communication.
State-of-the-art systems leverage transformer-based encoders and cross-modal fusion to address modality misalignment and reduce noisy signal integration.
Recent advances incorporate dual-reasoning, chain-of-thought prompting, and efficient fine-tuning to improve scalability, cross-target generalization, and interpretability.

Multimodal Stance Detection (MSD) is the computational task of inferring an author’s stance (e.g., favor, against, neutral) toward a specific target by jointly interpreting multiple data modalities—most commonly natural language text and images, but potentially including audio, video, social/contextual, or structural cues. MSD has become a central research area in NLP and multimodal machine learning, particularly as social media discourse increasingly involves richly multimodal content such as memes, infographics, and conversational threads. MSD systems go beyond classical text-only stance detection by leveraging cross-modal interactions, enabling disambiguation of implicit cues, sarcasm, and visually-grounded opinion signals that are otherwise opaque to unimodal methods (Pangtey et al., 13 May 2025).

1. Task Formalization and Core Challenges

The fundamental problem in MSD is to model the joint probability $p_\theta(y\mid T, I, C, V, S)$ , where $T$ is textual input (e.g., post or transcript), $I$ is an associated image or frame, $C$ is optional conversation/context history, $V$ represents possible video or audio, and $S$ may include social or structural features. The stance label $y$ is typically categorical, e.g., $y\in \{\text{favor}, \text{against}, \text{neutral}\}$ .

Key technical and conceptual challenges are:

Modality misalignment: Text and image content may carry weakly correlated or contradictory stance cues, frequently observed in memes or sarcastic posts (Pangtey et al., 13 May 2025, Wang et al., 8 Nov 2025).
Noisy fusion: Generic early/late fusion architectures risk amplifying irrelevant modal signals, especially when one modality lacks stance information (Wang et al., 2024).
Implicit and compositional cues: Stance expressions are often subtle, sarcastic, or context-dependent, requiring multi-hop reasoning or conversational grounding (Niu et al., 2024).
Data scarcity and annotation difficulty: Comprehensive, high-quality multimodal stance datasets are costly and domain-limited; inter-annotator reliability varies (e.g., MultiClimate’s $\kappa = 0.703$ ) (Wang et al., 2024).
Scalability and generalization: Models should function robustly “in-target” (same domain) and “cross-target” (domain shift), with minimal labeled data in the latter (Khiabani et al., 2023, Niu et al., 2024).

2. Datasets and Benchmarking in MSD

State-of-the-art progress in MSD is tightly linked to the development of annotated, multimodal corpora:

Dataset (Paper/Year)	Modalities	Size / Domain	Label Set	Key Features
MultiClimate (Wang et al., 2024)	Video frames, text	4,209 pairs / CC	Support, Oppose, Neutral	High IAA ( $\kappa=0.703$ ); sentence–frame alignments
MmMtCSD (Niu et al., 2024)	Text, image, convo	21,340 threads / Reddit	Favor, Against, None	Multiturn, vision-dependency flags
MMSD (Barel et al., 2024, Wang et al., 8 Nov 2025)	Text, image	17,544 tweets / 5 doms	Targeted stance; multi-granular	Twitter-based; topic diversity
Fakeddit (Pangtey et al., 13 May 2025)	Text, image, meta	1M+ posts / Reddit	7 labels	Noisy web-scale, veracity/stance

Annotation quality in these datasets is routinely assessed via Cohen’s $\kappa$ and majority-vote agreement; class imbalance and vision/text dependency rates are carefully documented. Datasets such as MultiClimate provide explicit train/dev/test splits and per-class label counts, supporting fair external comparisons (Wang et al., 2024).

3. Model Architectures and Fusion Strategies

The dominant MSD pipeline decomposes into unimodal encoders, multimodal fusion, and classifier:

Text encoders: Transformer-based models (BERT, RoBERTa, LLaMA2), often prompt- or domain-adapted (Liang et al., 2024).
Vision encoders: ResNet-50, ViT-base, or more recent vision-LLMs (CLIP, BLIP, SigLIP) (Wang et al., 2024, Vasilakes et al., 29 Jan 2025).
Fusion mechanisms:
- Early fusion: Concatenation of [CLS] text and penultimate visual features, followed by a shallow MLP (e.g., MultiClimate: $h_\text{fusion}=\text{MLP}([h_\text{text};h_\text{vision}])$ ) (Wang et al., 2024).
- Cross-modal Transformers: Token-level cross-attention between text and image representations, outperforming simple concatenation by 3–7 F1 points on several datasets (Pangtey et al., 13 May 2025).
- Prompt tuning: Target-centric textual and visual prompts prepended to input streams, e.g., TMPT introduces per-target prompt tokens in both the BERT and ViT pipelines (Liang et al., 2024).
- Experience-driven fusion: ReMoD implements dual-reasoning via dynamic weighting of modality contributions, adaptively re-weighting based on prior context and realized utility per sample (Wang et al., 8 Nov 2025).

Recent architectures increasingly integrate LLMs with LoRA-style parameter-efficient fine-tuning for scalable multimodal adaptation and to facilitate chain-of-thought or reasoning-step decompositions (MLLM-SD (Niu et al., 2024), ReMoD (Wang et al., 8 Nov 2025)).

4. Evaluation Metrics and Empirical Trends

Standard metrics include accuracy, per-class Precision/Recall/F1, and macro or weighted averages under class imbalance. For a multiclass setting with true positives $TP$ , false positives $FP$ , and false negatives $FN$ :

$\mathrm{Precision} = \frac{TP}{TP + FP}, ~ \mathrm{Recall} = \frac{TP}{TP + FN}, ~ F_1 = 2 \cdot \frac{\mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}$

Weighted F1 averages per-class F1 scores proportional to class prevalence (Wang et al., 2024, Pangtey et al., 9 Sep 2025).

Empirical findings include:

Text-only Transformer models (BERT, RoBERTa) outperform vision-only baselines across domains (Wang et al., 2024, Liang et al., 2024).
Joint models surpass unimodal ones: BERT+ViT and early-fusion or cross-modal Transformer architectures reliably yield higher F1 (e.g., MultiClimate: $0.747/0.749$ acc/F1; TMPT + CoT exceeds all baselines) (Wang et al., 2024, Liang et al., 2024).
Large off-the-shelf vision-LLMs (e.g., IDEFICS 9B, CLIP, BLIP, Llama3, Gemma2) do not outperform smaller, domain-tuned multimodal fusions (e.g., BERT+ViT@100M), indicating specialized adaptation is critical (Wang et al., 2024).
Multimodal MSD gains are greatest when individual modalities supply complementary stance signals or when text is noisy/ambiguous (Wang et al., 2024, Niu et al., 2024).
The most effective fusion requires not only architectural design but also appropriate context integration and domain adaptation (e.g., chain-of-thought, memory pools, or domain-specific summarization) (Niu et al., 2024, Wang et al., 8 Nov 2025, Pangtey et al., 9 Sep 2025).

5. Recent Methodological Advances

Several methodological innovations characterize the state of the art:

Dual-reasoning with adaptive memory (ReMoD): Instead of static fusion, ReMoD dynamically queries modality-specific “experience pools” to form a stance hypothesis (System 1), then refines the weights by cross-modal and semantic chain-of-thought inference (System 2). Ablations show substantial performance drops when memory or CoT modules are removed. This paradigm achieves improved in-target and zero-shot performance over both unimodal and traditional multimodal baselines (Wang et al., 8 Nov 2025).
Chain-of-thought prompting and captioning: MLLM-SD and related models use a “prompted chain” that includes synthetic image captions (from GPT4Vision) or explicit one-shot reasoning examples, boosting the model’s handling of implicit visual cues and complex conversational dynamics. Removing captions or CoT components significantly reduces F1 by up to 7 points (Niu et al., 2024).
PEFT and LoRA-style adaptation: Scaling multimodal MSD to large domains is constrained by computational/environmental costs (“Green NLP” concern). Efficient fine-tuning strategies (LoRA, adapters) facilitate training of large backbone models without full parameter updates (Niu et al., 2024, Pangtey et al., 13 May 2025).
Social/structural embedding fusion: Architectural advances such as TASTE inject interaction-graph embeddings into gated fusion with content Transformers, yielding robust performance in both dense conversational and sparse debate domains (Barel et al., 2024).
Few-shot/cross-target learning: Explicit modeling of target shifts via meta-learning, prompt-injection, or social graph voting (CT-TN) substantially enhances cross-domain stance generalization, especially under limited target data (Khiabani et al., 2023).

6. Modality Contribution, Generalization, and Open Issues

Systematic benchmarking reveals:

Text remains the dominating carrier of stance signals, but vision is critical in specific contexts (e.g., when text is brief, vague, or refers to visual evidence via deixis). In MultiClimate, text-only BERT achieves $0.705$ F1, vision-only ViT $0.462$, but joint BERT+ViT $0.749$, and state-of-the-art context-augmented fusion $0.762$ (Wang et al., 2024, Pangtey et al., 9 Sep 2025).
Vision’s value is highest when images directly encode stance (charts, written slogans, expressive memes) or when conversation history leaves text ambiguous (Vasilakes et al., 29 Jan 2025, Niu et al., 2024).
Multimodal MSD is challenging in zero-shot and cross-target regimes; memory-based or domain-adaptive models (ReMoD, TMPT+CoT, MLLM-SD) provide more stability than naïve fusion (Wang et al., 8 Nov 2025, Liang et al., 2024, Niu et al., 2024).
Large-scale VLMs (e.g., Ovis 1.6, Qwen2-VL) show limited performance gains from vision: text is the primary driver across languages, with most F1 improvements for “in-image text” rather than pictorial aspects. The value of image content is consistently less than that of textual elements, regardless of multilingual context (Vasilakes et al., 29 Jan 2025).
Multilingual MSD remains underexplored; performance consistency varies widely by model, language, and vision pipeline (Vasilakes et al., 29 Jan 2025).

Open issues include modality misalignment (e.g., sarcastic memes), explainability of fused decisions, adaptation to low-resource and non-English contexts, and the integration of richer modality signals (audio, video, conversational/discourse structure) (Pangtey et al., 13 May 2025, Wang et al., 2024). Label scarcity and annotation cost for emerging targets further constrain robust deployment.

7. Perspectives and Future Directions

Active research directions motivated by recent advances and evaluation gaps include:

Explainable and dynamic fusion: Design of architectures with per-sample interpretability (e.g., CoT traces, attention or gating scores), combined with retrieval-augmented and knowledge-injected LLMs (Wang et al., 8 Nov 2025, Pangtey et al., 13 May 2025).
Multimodal reasoning at scale: Expansion to larger, multi-turn, multi-user datasets (e.g., broader Reddit, YouTube, or multi-lingual sources), and incorporation of additional modalities such as audio or contextually-relevant video (Wang et al., 2024, Niu et al., 2024).
Efficient and eco-friendly adaptation: Leveraging PEFT, adapter-tuning, and meta-learning to minimize compute and data costs across domains and targets (“Green NLP”) (Niu et al., 2024, Pangtey et al., 13 May 2025).
Cross-lingual and cross-cultural robustness: Development of culture-aware fusions and evaluation on truly low-resource languages, addressing annotation and translation artifacts (Vasilakes et al., 29 Jan 2025).
Integration with social context: Social graph structure, user-level features, and network embeddings can significantly enhance stance resolution, especially in ambiguous or adversarial settings (Khiabani et al., 2023, Barel et al., 2024).
Comprehensive benchmarks and error typologies: Public release and standardization of larger, more diverse MSD datasets, with rich annotation of multimodal reasoning steps, error cases, and modality dependencies (Liang et al., 2024, Wang et al., 2024, Niu et al., 2024).

MSD thus sits at the intersection of multimodal machine learning, social media analysis, and language/vision reasoning, with dynamic advances in dataset design, architectural fusion, and few-shot/extensible learning motivating ongoing work toward more accurate, robust, and interpretable stance understanding systems.