GenAI Arena: An Open Evaluation Platform for Generative Models (2406.04485v4)

Published 6 Jun 2024 in cs.AI and cs.CV

Abstract: Generative AI has made remarkable strides to revolutionize fields such as image and video generation. These advancements are driven by innovative algorithms, architecture, and data. However, the rapid proliferation of generative models has highlighted a critical gap: the absence of trustworthy evaluation metrics. Current automatic assessments such as FID, CLIP, FVD, etc often fail to capture the nuanced quality and user satisfaction associated with generative outputs. This paper proposes an open platform GenAI-Arena to evaluate different image and video generative models, where users can actively participate in evaluating these models. By leveraging collective user feedback and votes, GenAI-Arena aims to provide a more democratic and accurate measure of model performance. It covers three tasks of text-to-image generation, text-to-video generation, and image editing respectively. Currently, we cover a total of 35 open-source generative models. GenAI-Arena has been operating for seven months, amassing over 9000 votes from the community. We describe our platform, analyze the data, and explain the statistical methods for ranking the models. To further promote the research in building model-based evaluation metrics, we release a cleaned version of our preference data for the three tasks, namely GenAI-Bench. We prompt the existing multi-modal models like Gemini, and GPT-4o to mimic human voting. We compute the accuracy by comparing the model voting with the human voting to understand their judging abilities. Our results show existing multimodal models are still lagging in assessing the generated visual content, even the best model GPT-4o only achieves an average accuracy of 49.19 across the three generative tasks. Open-source MLLMs perform even worse due to the lack of instruction-following and reasoning ability in complex vision scenarios.

PDF HTML Abstract

GenAI Arena: An Open Evaluation Platform for Generative Models

The paper "GenAI Arena: An Open Evaluation Platform for Generative Models" addresses a pressing challenge in the domain of generative AI— the reliable and comprehensive evaluation of generative models. Generative models have seen rapid advancements, particularly in text-to-image, image editing, and text-to-video generation tasks. However, assessing their performance remains problematic due to the inadequacies of existing automatic metrics and the subjectivity of human judgments.

Key Contributions

The authors introduce GenAI-Arena, a novel evaluation platform that democratizes the evaluation process by leveraging community votes. The platform allows users to participate directly in comparing the outputs of various generative AI models. It collects preferences and votes to provide a more accurate measure of model performance. Three specific tasks are highlighted within the platform:

Text-to-Image Generation
Image Editing
Text-to-Video Generation

The GenAI-Arena is backed by the largest array of open-source models yet evaluated together, covering 27 distinct generative models. The platform has collected over 6000 votes in its operational span of four months, bolstering its rankings and evaluations.

Methodology

The methodology relies on a dynamic and interactive interface where users can anonymously vote on side-by-side comparisons of generative model outputs. The votes are evaluated using the Elo rating system, modified to account for the inherently subjective nature of human preferences in visual content. The system updates model rankings based on votes to determine relative performance accurately.

The paper also describes the release of GenAI-Bench, a cleaned dataset derived from user voting data. This data is aimed at promoting research into model-based evaluation metrics. By prompting multimodal models like GPT-4o to mimic human voting, the authors compute correlations between model voting and human preferences. Notably, GPT-4o achieves a Pearson correlation of 0.22 in assessing visual content quality, suggesting that current multimodal models still lag far behind human judgment.

Results and Analysis

The leaderboard rankings yielded several insights:

Text-to-Image Generation: The models PlayGround V2.5 and Playground V2 led in performance, significantly outperforming the baseline SDXL model.
Image Editing: MagicBrush ranked highest, highlighting the efficacy of trained models over zero-shot approaches like Pix2PixZero and SDEdit.
Text-to-Video Generation: T2V-Turbo emerged as the top performer, with the leaderboard introducing methods that balanced computational efficiency with output quality.

The winning fraction heatmaps and case studies confirmed that user preferences effectively highlighted subtle differences between model outputs, reinforcing the robustness of the platform's evaluation methodology. The analysis also raised the point that Elo ratings could be slightly biased due to the imbalance between "easy" and "hard" competitions.

Implications and Future Work

The implications of this work are dual-faceted. Practically, GenAI-Arena simplifies the evaluation of multiple generative models, providing a reliable reference for both developers and users of generative AI. Theoretically, it offers a robust dataset—GenAI-Bench—for the research community, encouraging the development of better model-based evaluation metrics. The findings also underscore the limitations of current multimodal models in evaluating visual content, pointing toward the need for more advanced evaluative AI systems.

In future developments, continuous data collection will enhance the leaderboard's accuracy, making it reflective of the latest generative AI advancements. Additionally, there is scope for developing more sophisticated MLLMs to better mirror human judgments.

Conclusion

The paper underscores the necessity for effective evaluation strategies in generative AI. By introducing GenAI-Arena, it addresses a critical gap, providing a platform that integrates user feedback into model evaluations and establishes clear performance benchmarks. The results highlight both the capabilities and limitations of current evaluation metrics and models, offering pathways for future advancements.