GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation (2401.04092v2)

Published 8 Jan 2024 in cs.CV

Abstract: Despite recent advances in text-to-3D generative methods, there is a notable absence of reliable evaluation metrics. Existing metrics usually focus on a single criterion each, such as how well the asset aligned with the input text. These metrics lack the flexibility to generalize to different evaluation criteria and might not align well with human preferences. Conducting user preference studies is an alternative that offers both adaptability and human-aligned results. User studies, however, can be very expensive to scale. This paper presents an automatic, versatile, and human-aligned evaluation metric for text-to-3D generative models. To this end, we first develop a prompt generator using GPT-4V to generate evaluating prompts, which serve as input to compare text-to-3D models. We further design a method instructing GPT-4V to compare two 3D assets according to user-defined criteria. Finally, we use these pairwise comparison results to assign these models Elo ratings. Experimental results suggest our metric strongly align with human preference across different evaluation criteria.

PDF Abstract

Introduction

In the field of computational creativity, text-to-3D generation stands out as a vibrant domain where the written word is transformed into three-dimensional visual artifacts. Despite the striking progress in this field, evaluating the output of text-to-3D generative methods has remained a challenge. The common practice involves assessing models on a narrow set of criteria or through labor-intensive user studies. A recent framework, leveraging the capabilities of Large Multimodal Models (LMMs), aims to address these constraints by offering an automated, comprehensive, and human-aligned evaluation metric.

The Advent of GPT-4V

LMMs, especially GPT-4V, are at the heart of this innovative framework. With the ability to understand text and visual input, GPT-4V emerges as a tool proficient in interpreting human intention and conducting 3D reasoning tasks. By combining descriptive text prompts with 3D model renderings, it assesses the text-to-3D generative models against customized criteria, reflecting a human-like understanding of quality and aesthetic.

Transformative Evaluation Metrics

This framework introduces a 'meta-prompt' system that generates text prompts tailored to specific evaluation needs, ensuring that the diversity and evolving nature of human preferences are captured. Furthermore, it devises a method for GPT-4V to compare 3D models based on user-defined evaluation criteria, akin to evaluating answers to an examination. The pairwise comparison outcomes are then utilized to rank models using an Elo rating system tailored for text-to-3D tasks, bringing about a scalable and systematic evaluation approach.

Empirical Insights

Empirical results have established that this framework closely aligns with human preferences, scoring higher on alignment across varied criteria than existing metrics. It not only benefits from the rich language understanding of LMMs but also exemplifies how such models can be instructed effectively through meticulously crafted prompts and images, which act as bridges between textual descriptions and three-dimensional interpretations.

In summary, this framework marks a significant step in evaluating text-to-3D models, offering an assessment that is both wide-ranging and attuned to human judgment. It is a testament to the versatility and potential of using LMMs like GPT-4V in non-traditional applications. The provided open-source code encourages further advancements, ensuring robust evaluation metrics keep pace with the rapid evolution in text-to-3D generative models.