WorldScore: A Unified Evaluation Benchmark for World Generation (2504.00983v1)

Published 1 Apr 2025 in cs.GR, cs.AI, and cs.CV

Abstract: We introduce the WorldScore benchmark, the first unified benchmark for world generation. We decompose world generation into a sequence of next-scene generation tasks with explicit camera trajectory-based layout specifications, enabling unified evaluation of diverse approaches from 3D and 4D scene generation to video generation models. The WorldScore benchmark encompasses a curated dataset of 3,000 test examples that span diverse worlds: static and dynamic, indoor and outdoor, photorealistic and stylized. The WorldScore metrics evaluate generated worlds through three key aspects: controllability, quality, and dynamics. Through extensive evaluation of 19 representative models, including both open-source and closed-source ones, we reveal key insights and challenges for each category of models. Our dataset, evaluation code, and leaderboard can be found at https://haoyi-duan.github.io/WorldScore/

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

Evaluation and Insights from the WorldScore Benchmark for World Generation

The paper introduces WorldScore, an innovative and unified benchmark designed to evaluate world generation capabilities across diverse modeling paradigms, notably 3D, 4D, text-to-video (T2V), and image-to-video (I2V) models. The benchmark addresses the expanding scope of world generation by proposing a structured evaluation framework that encompasses static and dynamic environment creation.

Key Contributions

Unified Evaluation Framework: WorldScore systematically dissects world generation into a series of next-scene generation tasks. Each scene transition is elucidated through triplets comprising the current scene, targeted next-scene description, and camera layout. Importantly, this design supports a consistent output format, facilitating equitable evaluation across disparate modeling techniques.
Comprehensive Dataset: A significant offering of WorldScore is its extensive dataset of 3,000 test examples. It encapsulates a broad spectrum of environments, from static indoor/outdoor scenes to dynamic motions, and spans various visual styles from photorealistic to stylized. This diversity is crucial for effectively assessing the capability of models in generating complex and varied world structures.
Evaluation Metrics: WorldScore presents a multidimensional evaluation approach, leveraging ten specific metrics grouped under three main axes: controllability, quality, and dynamics. The focus is on both adherence to control inputs (like camera directions), fidelity of scene content, and stability of dynamic simulations.
Benchmarking of Existing Models: The authors execute a detailed evaluation of 19 different world generation models, illustrating the strengths and limitations inherent in current methodologies. This comprehensive comparison reveals valuable insights—chief among them is the superior static world generation performance by 3D scene models compared to their video generation counterparts.

Insights and Challenges

One of the fundamental insights gleaned from the WorldScore evaluation is the clear disparity between 3D scene generation models and video generation models in static world tasks. While models like WonderWorld and LucidDreamer stand out in terms of 3D consistency and camera controllability, their efficacy does not extend seamlessly into dynamic environments—a limitation highlighted by the suboptimal performance of models like 4D-fy in full-fledged 4D world tasks.

Conversely, video generation models face inherent challenges in camera control; even top-performers like CogVideoX-I2V exhibit limitations in generating coherent and controllable camera movements. Moreover, while video models can sometimes achieve larger motion magnitude, a trade-off in motion smoothness is often observed, indicating a gap in effectively balancing these dynamic aspects.

Notably, the empirical evaluation suggests room for improvement in motion placement accuracy in dynamic scenes, as the presence of motion does not always align with intended regions. This shortcoming highlights an avenue for employing refined motion modeling approaches which aren't just capable of introducing motion but also ensure its contextual relevance and accuracy.

Implications and Future Directions

The implications of the WorldScore benchmark extend beyond immediate performance assessment, providing a foundation for iterative model advancements and methodological innovations. By pinpointing specific limitations—such as camera control in video generation, and motion dynamics in 3D scene models—WorldScore enables a focused roadmap for enhancing model capabilities.

The benchmark's comprehensive dataset and diagnostics could inspire future endeavors to bridge the chasm between the realms of 3D and 4D scene representations, and devise mechanisms that imbue video models with greater controllability and holism. Moreover, by drawing a clear trajectory of model strengths and weaknesses, WorldScore posits a fertile ground for burgeoning research into hybrid models that could effectively amalgamate the strengths of different modeling paradigms.

In conclusion, WorldScore is a notable step toward realizing the potential of world generation systems across diverse applications. By providing a robust platform for comparative evaluation, this benchmark stands poised to catalyze the evolution of models toward more sophisticated, nuanced, and versatile world generation capabilities.

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (5)

GitHub

Tweets

https://twitter.com/Koven_Yu/status/1909313273100710393