Applying Large Language Models to Multimodal Content Analysis
Determine effective approaches for applying large language models (LLMs) to multimodal content analysis that integrates textual and visual inputs, establishing whether and how LLMs can be used to analyze multimodal content reliably.
References
Recent LLMs have shown great capability for text understanding and reasoning, while their application to multimodal content analysis is still an open problem.
                — Zero-Shot Character Identification and Speaker Prediction in Comics via Iterative Multimodal Fusion
                
                (2404.13993 - Li et al., 22 Apr 2024) in Abstract