Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

133 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation (2504.02782v3)

Published 3 Apr 2025 in cs.CV

Abstract: The recent breakthroughs in OpenAI's GPT4o model have demonstrated surprisingly good capabilities in image generation and editing, resulting in significant excitement in the community. This technical report presents the first-look evaluation benchmark (named GPT-ImgEval), quantitatively and qualitatively diagnosing GPT-4o's performance across three critical dimensions: (1) generation quality, (2) editing proficiency, and (3) world knowledge-informed semantic synthesis. Across all three tasks, GPT-4o demonstrates strong performance, significantly surpassing existing methods in both image generation control and output quality, while also showcasing exceptional knowledge reasoning capabilities. Furthermore, based on the GPT-4o's generated data, we propose a classification-model-based approach to investigate the underlying architecture of GPT-4o, where our empirical results suggest the model consists of an auto-regressive (AR) combined with a diffusion-based head for image decoding, rather than the VAR-like architectures. We also provide a complete speculation on GPT-4o's overall architecture. In addition, we conduct a series of analyses to identify and visualize GPT-4o's specific limitations and the synthetic artifacts commonly observed in its image generation. We also present a comparative study of multi-round image editing between GPT-4o and Gemini 2.0 Flash, and discuss the safety implications of GPT-4o's outputs, particularly their detectability by existing image forensic models. We hope that our work can offer valuable insight and provide a reliable benchmark to guide future research, foster reproducibility, and accelerate innovation in the field of image generation and beyond. The codes and datasets used for evaluating GPT-4o can be found at https://github.com/PicoTrex/GPT-ImgEval.

Summary

The paper introduces GPT-ImgEval, a comprehensive benchmark for evaluating GPT-4o’s image generation, editing, and semantic synthesis capabilities.
It employs quantitative metrics like FID, IS, and CLIP Score alongside human evaluations to assess image quality and editing precision.
The study uncovers insights into GPT-4o’s hybrid AR-diffusion architecture and outlines limitations in current forensic detection techniques.

This technical report introduces GPT-ImgEval (2504.02782), a benchmark designed for the quantitative and qualitative assessment of OpenAI's GPT-4o model in image generation tasks. The evaluation focuses on three primary dimensions: the quality of generated images, the proficiency in image editing, and the ability to synthesize images informed by world knowledge.

Benchmark Design and Evaluation Dimensions

The GPT-ImgEval benchmark systematically evaluates GPT-4o across distinct facets of image generation and manipulation.

Generation Quality: This dimension assesses the fundamental image synthesis capabilities. It likely involves evaluating aspects such as photorealism, coherence, aesthetic appeal, and adherence to prompt specifications for de novo image creation. Standard image quality metrics (e.g., FID, IS, CLIP Score) might be employed alongside human evaluation for subjective qualities.
Editing Proficiency: This dimension probes GPT-4o's ability to modify existing images based on instructions. Tasks could range from simple object additions/removals or style transfers to complex compositional changes. Evaluation would focus on the accuracy of the edit, the preservation of unchanged regions, and the overall quality of the resulting image. The report specifically mentions a comparative paper of multi-round image editing against Gemini 2.0 Flash, suggesting iterative refinement capabilities are tested.
World Knowledge-Informed Semantic Synthesis: This dimension evaluates the model's capacity to integrate factual or conceptual knowledge into the image generation process. Examples might include generating images depicting specific historical events, scientific concepts, or culturally specific scenarios, requiring the model to access and accurately represent underlying knowledge. Success is measured by the semantic correctness and plausibility of the generated image relative to the knowledge embedded in the prompt.

Performance Analysis and Key Findings

Across the evaluated dimensions, GPT-4o demonstrates significantly strong performance, reportedly surpassing existing methodologies in both the control exerted over the generation process and the quality of the final image outputs. The model also exhibits notable capabilities in knowledge reasoning applied to semantic synthesis. The benchmark results quantify these advantages, providing a baseline for future model comparisons.

The report also explores identifying and visualizing the limitations and common failure modes of GPT-4o's image generation. This includes analyzing specific types of synthetic artifacts present in the outputs, which is crucial for understanding the model's current shortcomings and guiding future improvements in diffusion or generative model training.

Architecture Investigation and Speculation

A significant contribution of the work is the methodology proposed for inferring aspects of GPT-4o's internal architecture. Based on the characteristics of the generated data, the authors employ a classification-model-based approach to probe the underlying generation mechanism.

Their empirical findings suggest that GPT-4o likely utilizes a hybrid architecture consisting of an auto-regressive (AR) component combined with a diffusion-based head for image decoding. This contrasts with architectures like VQ-VAE or other Vector-Quantized Autoencoder (VAR)-like models. The AR component might handle sequence modeling or high-level planning, while the diffusion head refines the output into a high-resolution image.

Input: Generated images I_gpt4o from GPT-4o
Input: Generated images I_ar from known AR models
Input: Generated images I_diff from known Diffusion models
Input: Generated images I_var from known VAR models

Classifier_AR = train_classifier(I_ar, label="AR")
Classifier_Diff = train_classifier(I_diff, label="Diffusion")
Classifier_VAR = train_classifier(I_var, label="VAR")

MultiClassifier = train_multiclass_classifier([I_ar, I_diff, I_var], ["AR", "Diffusion", "VAR"])

predictions = MultiClassifier.predict(I_gpt4o)
pred_ar = Classifier_AR.predict(I_gpt4o)
pred_diff = Classifier_Diff.predict(I_gpt4o)
pred_var = Classifier_VAR.predict(I_gpt4o)

analyze_predictions(predictions) # or (pred_ar, pred_diff, pred_var)

Based on these findings, the report provides a complete, albeit speculative, overview of the potential GPT-4o architecture, integrating multi-modal inputs, the AR component, and the diffusion decoder head.

Safety and Forensics Implications

The paper addresses the safety aspects of GPT-4o's image generation capabilities. It investigates the detectability of GPT-4o-generated images using existing image forensic models. This analysis is pertinent given the increasing sophistication of synthetic media and the potential for misuse. The findings shed light on whether current forensic techniques are sufficient to identify images produced by state-of-the-art models like GPT-4o, highlighting potential gaps in detection mechanisms.

Resources

The authors have made the code and datasets used for the GPT-ImgEval benchmark publicly available, facilitating reproducibility and further research by the community. These resources can be found at the following repository: https://github.com/PicoTrex/GPT-ImgEval

Conclusion

GPT-ImgEval provides a structured benchmark for evaluating the image generation and editing capabilities of GPT-4o. The report offers quantitative performance results, insights into the model's limitations and artifacts, and a novel data-driven approach to inferring its underlying architecture, suggesting a hybrid AR-Diffusion model. Furthermore, it initiates a discussion on the safety and forensic detectability of outputs from advanced generative models.

PDF Markdown

GitHub

GitHub - PicoTrex/GPT-ImgEval (2 stars)

Tweets

https://twitter.com/LinBin46984/status/1908003539609333904

https://twitter.com/_akhaliq/status/1908169608525185256

https://twitter.com/arxivsanitybot/status/1908151720582025719

https://twitter.com/ADarmouni/status/1908601858454937843