Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark (2504.16427v2)

Published 23 Apr 2025 in cs.CL, cs.AI, and cs.MM

Abstract: Multimodal language analysis is a rapidly evolving field that leverages multiple modalities to enhance the understanding of high-level semantics underlying human conversational utterances. Despite its significance, little research has investigated the capability of multimodal LLMs (MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA comprises over 61K multimodal utterances drawn from both staged and real-world scenarios, covering six core dimensions of multimodal semantics: intent, emotion, dialogue act, sentiment, speaking style, and communication behavior. We evaluate eight mainstream branches of LLMs and MLLMs using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive experiments reveal that even fine-tuned models achieve only about 60%~70% accuracy, underscoring the limitations of current MLLMs in understanding complex human language. We believe that MMLA will serve as a solid foundation for exploring the potential of LLMs in multimodal language analysis and provide valuable resources to advance this field. The datasets and code are open-sourced at https://github.com/thuiar/MMLA.

PDF Abstract

A Benchmark Evaluating the Effectiveness of LLMs in Multimodal Language Analysis

The paper "Can LLMs Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark" presents a significant step forward in understanding the capabilities of LLMs and multimodal LLMs (MLLMs) for multimodal language analysis. The authors introduce the MMLA benchmark, designed to evaluate the capacity of these models to interpret cognitive-level semantics across multiple dimensions of human conversation. The benchmark features six core dimensions: intent, emotion, dialogue act, sentiment, speaking style, and communication behavior, all represented within a substantial dataset of over 61K annotated multimodal utterances.

Core Contributions of MMLA

The MMLA benchmark is a pioneering tool to systematically assess the competency of foundation models, especially MLLMs, in handling complex semantic tasks. MMLA offers a detailed evaluation of nine mainstream models using three methods: zero-shot inference, supervised fine-tuning (SFT), and instruction tuning (IT). The authors underline that even fine-tuned models currently reach only 60-70% accuracy, which highlights the existing limitations and areas needing improvement.

Evaluation Results

Experiments employing SFT reveal that MLLMs significantly outperform LLMs by integrating both verbal and non-verbal modalities, achieving new state-of-the-art performance in most tasks in MMLA. However, the models still fall short of achieving higher than 70% accuracy on average, indicating considerable room for advancement.

Zero-shot inferences indicate negligible differences between LLMs and MLLMs of the same parameter scale. Remarkably, smaller-scale models — particularly the 8B MiniCPM-V-2.6 model — showed competitive performance against larger models, emphasizing the potential of efficiently designed MLLMs. IT demonstrates that training a unified model can achieve competitive results on multimodal language tasks, further supporting the scalability of smaller models in robust problem-solving scenarios.

Implications and Future Directions

The paper positions MMLA as an essential resource for future studies in cognitive-level AI. The authors suggest that although MLLMs have shown promising performance, existing models require improved architectures and more profound understanding to address the intrinsic intricacies of multimodal language analysis comprehensively. MMLA sets the groundwork for research targeting cognitive-level semantic understanding and provides a foundation for the development of AI systems that closely emulate human-like interpretations and assistive interactions.

The authors open possibilities for avenues such as enhanced model architectures, improved alignment techniques between modalities, and high-quality, large-scale datasets to enrich model training. These efforts could eventually lead to MLLMs performing near human-level in complex semantics of photographic, textual, and auditory data amalgamation.

In conclusion, the paper elucidates the importance and complexity of the proposed benchmark while establishing concrete directions for improving multimodal language understanding. Thus, MMLA serves both as a pivotal evaluative framework and a scholarly incitement towards future innovations in multimodal language AI.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Hanlei Zhang (13 papers)
Zhuohang Li (24 papers)
Yeshuang Zhu (8 papers)
Hua Xu (78 papers)
Peiwu Wang (1 paper)
Jinchao Zhang (49 papers)
Jie Zhou (687 papers)
Haige Zhu (1 paper)

Related Papers

Find Related Papers

YouTube

Show All Videos