Introduction
In the field of computational creativity, text-to-3D generation stands out as a vibrant domain where the written word is transformed into three-dimensional visual artifacts. Despite the striking progress in this field, evaluating the output of text-to-3D generative methods has remained a challenge. The common practice involves assessing models on a narrow set of criteria or through labor-intensive user studies. A recent framework, leveraging the capabilities of Large Multimodal Models (LMMs), aims to address these constraints by offering an automated, comprehensive, and human-aligned evaluation metric.
The Advent of GPT-4V
LMMs, especially GPT-4V, are at the heart of this innovative framework. With the ability to understand text and visual input, GPT-4V emerges as a tool proficient in interpreting human intention and conducting 3D reasoning tasks. By combining descriptive text prompts with 3D model renderings, it assesses the text-to-3D generative models against customized criteria, reflecting a human-like understanding of quality and aesthetic.
Transformative Evaluation Metrics
This framework introduces a 'meta-prompt' system that generates text prompts tailored to specific evaluation needs, ensuring that the diversity and evolving nature of human preferences are captured. Furthermore, it devises a method for GPT-4V to compare 3D models based on user-defined evaluation criteria, akin to evaluating answers to an examination. The pairwise comparison outcomes are then utilized to rank models using an Elo rating system tailored for text-to-3D tasks, bringing about a scalable and systematic evaluation approach.
Empirical Insights
Empirical results have established that this framework closely aligns with human preferences, scoring higher on alignment across varied criteria than existing metrics. It not only benefits from the rich language understanding of LMMs but also exemplifies how such models can be instructed effectively through meticulously crafted prompts and images, which act as bridges between textual descriptions and three-dimensional interpretations.
In summary, this framework marks a significant step in evaluating text-to-3D models, offering an assessment that is both wide-ranging and attuned to human judgment. It is a testament to the versatility and potential of using LMMs like GPT-4V in non-traditional applications. The provided open-source code encourages further advancements, ensuring robust evaluation metrics keep pace with the rapid evolution in text-to-3D generative models.