Applying Large Language Models to Multimodal Content Analysis

Determine effective approaches for applying large language models (LLMs) to multimodal content analysis that integrates textual and visual inputs, establishing whether and how LLMs can be used to analyze multimodal content reliably.

Background

The paper studies zero-shot character identification and speaker prediction in comics by integrating textual and visual information through an iterative multimodal framework. While LLMs have demonstrated strong capabilities in text understanding and reasoning, their role in multimodal content analysis remains unresolved.

The authors note that existing large multimodal models can only handle a small number of images at a time, whereas comics analysis requires longer-range context across multiple pages and persistent character identity tracking. Their framework provides a first baseline, but the broader challenge of effectively applying LLMs to multimodal content analysis remains open.

References

Recent LLMs have shown great capability for text understanding and reasoning, while their application to multimodal content analysis is still an open problem.

— Zero-Shot Character Identification and Speaker Prediction in Comics via Iterative Multimodal Fusion (2404.13993 - Li et al., 22 Apr 2024) in Abstract

Applying Large Language Models to Multimodal Content Analysis

Sponsor

Background

References

Related Problems