Benchmarking Large Multimodal Models against Common Corruptions (2401.11943v1)

Published 22 Jan 2024 in cs.LG, cs.CL, cs.CR, cs.CV, and cs.MM

Abstract: This technical report aims to fill a deficiency in the assessment of large multimodal models (LMMs) by specifically examining the self-consistency of their outputs when subjected to common corruptions. We investigate the cross-modal interactions between text, image, and speech, encompassing four essential generation tasks: text-to-image, image-to-text, text-to-speech, and speech-to-text. We create a comprehensive benchmark, named MMCBench, that covers more than 100 popular LMMs (totally over 150 model checkpoints). A thorough evaluation under common corruptions is critical for practical deployment and facilitates a better understanding of the reliability of cutting-edge LMMs. The benchmarking code is available at https://github.com/sail-sg/MMCBench

PDF Abstract

Benchmarking Large Multimodal Models against Common Corruptions: A Critical Evaluation

The paper "Benchmarking Large Multimodal Models against Common Corruptions" presents an extensive evaluation framework, MMCBench, designed to scrutinize the robustness and self-consistency of Large Multimodal Models (LMMs). This investigation is primarily focused on examining how these models perform under various common corruptions across multiple modalities—text, image, and speech.

Core Contributions and Methodology

The paper addresses significant gaps in existing multimodal model assessments by introducing the MMCBench benchmark, which evaluates over 100 popular LMMs utilizing more than 150 model checkpoints. This evaluation is conducted under conditions of typical corruptions that inputs might encounter in practical deployment scenarios. The benchmark is constructed around four principal tasks reflecting cross-modal interactions: text-to-image, image-to-text, text-to-speech, and speech-to-text. This multifaceted approach allows for a comprehensive understanding of how well these models can maintain self-consistency when their inputs undergo corruption.

To facilitate the selection of meaningful examples for evaluation, the authors leverage extensive datasets such as LAION and Common Voice. They adopt a strategy to select examples that are likely to exhibit significant text-based semantic deviations when subjected to corruption. The evaluation approach capitalizes on measuring both cross-modality and output-only modality self-consistency, depending on the existence and reliability of cross-modality models for a given task.

Evaluation and Results

The paper delivers exhaustive numerical analyses, providing a comparative paper of various LMMs. Key results indicate that certain models, like IF XL and Kandinsky 3 for text-to-image generation, demonstrate a higher resilience to common corruptions. In the text-to-speech and speech-to-text paradigms, models based on robust architectures like Conformer and Whisper show strong performance in maintaining transcription accuracy under audio corruptions.

Contrarily, the benchmark also highlights that larger models do not universally guarantee higher resilience to corruptions, particularly when tasked with generating realistic outputs from heavily corrupted inputs. These findings necessitate nuanced insights into model design choices, suggesting that model architecture and training paradigms can be as pivotal as model size in determining robustness.

Theoretical and Practical Implications

From a theoretical standpoint, the paper underscores the critical importance of incorporating corruption robustness as a fundamental aspect of LMM development. The established MMCBench serves as an indispensable tool for systematically identifying and addressing vulnerabilities within these models, ensuring they are more reliability-oriented for real-world applications.

On the practical side, consistent evaluation against MMCBench's metrics could guide the improvement and deployment readiness of multimodal models. For developers and researchers, the benchmark’s extensive suite of tests offers a robust foundation to fortify models against the unpredictable variances present in real-world data inputs.

Future Prospects

As the benchmark evolves, potential expansions could include more modalities, such as video, broadening the horizon for further exploration in LMM robustness. Furthermore, advancements in calculating more sensitive cross-modality similarities, particularly between speech and text, could yield enhanced insights into the subtleties of LMM performance under corruption.

In conclusion, this work constitutes an essential stride towards understanding and enhancing the robustness of LMMs. It establishes a framework that not only aids in the assessment of current models but also charts a path for more resilient multimodal learning systems in the future.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Jiawei Zhang (529 papers)
Tianyu Pang (96 papers)
Chao Du (83 papers)
Yi Ren (215 papers)
Bo Li (1107 papers)
Min Lin (96 papers)

Citations (8)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - sail-sg/MMCBench (27 stars)

Tweets

https://twitter.com/skylerrosling/status/1750189230268317840