Multimodal Analysis of State-Funded News Coverage of the Israel-Hamas War on YouTube Shorts

This presentation examines a systematic multimodal pipeline designed to analyze how state-funded news outlets cover the Israel-Hamas war through YouTube Shorts. The study integrates automatic speech transcription, aspect-based sentiment analysis, and visual scene classification across over 2,300 videos from Al Jazeera, BBC, Deutsche Welle, and TRT World. The research reveals striking polarization patterns in both linguistic and visual content, demonstrates that smaller domain-adapted models can outperform large language models in political sentiment analysis, and provides a scalable framework for studying conflict coverage in algorithmic video environments.
Script
How do state-funded news outlets frame the Israel-Hamas war in the age of algorithmic short-form video? Four major broadcasters, over 2,300 YouTube Shorts, nearly 95,000 visual frames, and a year of coverage reveal systematic patterns in both what is said and what is shown.
The researchers built a three-stage pipeline to dissect both linguistic and visual signals. Speech becomes structured text. Text becomes categorized sentiment toward specific political actors and entities. Visual frames become semantic scene types, capturing everything from destruction imagery to formal diplomatic events.
Here's what challenges current trends in natural language processing. The study found that a smaller, domain-adapted model outperformed both its larger sibling and a 7 billion parameter language model in sentiment analysis of political transcripts. Task-specific fine-tuning on political discourse data proved more valuable than sheer parameter count, achieving 81.9% macro-F1 accuracy.
The sentiment patterns are stark and stable. Nearly half of all coverage carries emotional weight, skewing negative. Two outlets maintain consistent polarization throughout the year, while two others hold to neutral territory. But here's the twist: the most engaging videos often pair dramatic visual scenes with minimal spoken commentary, letting imagery do the emotional work.
The visual layer tells its own story. Seven scene categories capture the semantic distribution of what audiences actually see. Destruction and humanitarian crisis imagery surge at key moments, often detached from voice-over. Protest scenes track real-world mobilization. TRT World uses destruction and protest imagery disproportionately, an editorial choice that amplifies emotional salience and drives engagement.
These patterns expose how algorithmic short-form video reshapes conflict coverage.
The pipeline offers a scalable, reproducible method for analyzing digital discourse in video environments where text, speech, and visuals interweave to shape perception. The findings reveal that smaller, domain-adapted models can outperform large language models in specialized tasks, and that polarization in conflict coverage is encoded through subtle multimodal interplay rather than overt alignment. To explore more research like this and create your own videos, visit EmergentMind.com.