UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation (2505.10483v1)

Published 15 May 2025 in cs.CV and cs.AI

Abstract: The emergence of unified multimodal understanding and generation models is rapidly attracting attention because of their ability to enhance instruction-following capabilities while minimizing model redundancy. However, there is a lack of a unified evaluation framework for these models, which would enable an elegant, simplified, and overall evaluation. Current models conduct evaluations on multiple task-specific benchmarks, but there are significant limitations, such as the lack of overall results, errors from extra evaluation models, reliance on extensive labeled images, benchmarks that lack diversity, and metrics with limited capacity for instruction-following evaluation. To tackle these challenges, we introduce UniEval, the first evaluation framework designed for unified multimodal models without extra models, images, or annotations. This facilitates a simplified and unified evaluation process. The UniEval framework contains a holistic benchmark, UniBench (supports both unified and visual generation models), along with the corresponding UniScore metric. UniBench includes 81 fine-grained tags contributing to high diversity. Experimental results indicate that UniBench is more challenging than existing benchmarks, and UniScore aligns closely with human evaluations, surpassing current metrics. Moreover, we extensively evaluated SoTA unified and visual generation models, uncovering new insights into Univeral's unique values.

Collections

Summary

UniEval: Unified Evaluation Framework for Multimodal Models

The paper "UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation" introduces a novel evaluation framework designed specifically for the emerging field of unified multimodal models. These models integrate multiple modalities, such as text and image, to enhance instruction-following capabilities while reducing redundancy in model design. Despite their promise, these models have faced challenges in terms of evaluation, primarily due to the lack of a simplified and comprehensive framework. Current methods rely on several task-specific benchmarks that often complicate cross-task comparisons and make it difficult to obtain standardized results. UniEval addresses these limitations and provides a coherent evaluation framework without the reliance on extra models, labeled images, or annotations.

The core components of UniEval are UniBench, a diverse and challenging benchmark, and the UniScore metric, which aligns closely with human evaluations. UniBench distinguishes itself by offering a wide array of 81 fine-grained tags, significantly exceeding the diversity of existing benchmarks. This diversity is crucial for assessing advanced models capable of complex instruction-following tasks. UniScore, designed for both case-level and tag-level accuracy analysis, shows a strong correlation with human assessment scores, outperforming existing metrics such as CLIPScore and VQAScore.

Experimental results demonstrate that UniBench is more challenging than existing benchmarks, with its error rates indicating substantial room for model improvement. Furthermore, UniScore's correlation with human evaluations underscores its potential as a reliable metric for instruction-following tasks. Evaluations conducted on state-of-the-art unified models reveal unique insights, such as self-consistency in understanding generated images and the impact of resolution on evaluation scores, paving the way for more nuanced model analyses.

The implications of UniEval's introduction are profound for both practical and theoretical domains of AI. Practically, this unified evaluation framework simplifies the evaluation process by removing dependencies on external resources, thereby facilitating more accessible and integrated testing of multimodal models. Theoretically, UniEval challenges the research community to develop models that can leverage integrated evaluations for both understanding and generation, fostering advancements in generative AI. As future developments unfold, UniEval sets a foundational standard that could extend to more complex multimodal tasks, including video and audio modalities, positioning it as a pivotal tool in the progression of AI research. Overall, UniEval is poised to drive forward the capabilities and assessments of unified multimodal models, ensuring they fulfil their potential in multimodal comprehension and generation tasks.

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now