FlashWorld: High-quality 3D Scene Generation within Seconds (2510.13678v1)

Published 15 Oct 2025 in cs.CV

Abstract: We propose FlashWorld, a generative model that produces 3D scenes from a single image or text prompt in seconds, 10~100$\times$ faster than previous works while possessing superior rendering quality. Our approach shifts from the conventional multi-view-oriented (MV-oriented) paradigm, which generates multi-view images for subsequent 3D reconstruction, to a 3D-oriented approach where the model directly produces 3D Gaussian representations during multi-view generation. While ensuring 3D consistency, 3D-oriented method typically suffers poor visual quality. FlashWorld includes a dual-mode pre-training phase followed by a cross-mode post-training phase, effectively integrating the strengths of both paradigms. Specifically, leveraging the prior from a video diffusion model, we first pre-train a dual-mode multi-view diffusion model, which jointly supports MV-oriented and 3D-oriented generation modes. To bridge the quality gap in 3D-oriented generation, we further propose a cross-mode post-training distillation by matching distribution from consistent 3D-oriented mode to high-quality MV-oriented mode. This not only enhances visual quality while maintaining 3D consistency, but also reduces the required denoising steps for inference. Also, we propose a strategy to leverage massive single-view images and text prompts during this process to enhance the model's generalization to out-of-distribution inputs. Extensive experiments demonstrate the superiority and efficiency of our method.

Summary

The paper introduces a dual-mode pre-training strategy that unifies MV- and 3D-oriented paradigms to produce high-quality 3D scenes within seconds.
It leverages cross-mode distillation with a consistency loss to enhance visual fidelity and ensure semantic coherence while drastically reducing inference time.
Robust performance is achieved through out-of-distribution co-training on diverse datasets, supporting both text-to-3D and image-to-3D scene generation applications.

FlashWorld: High-Quality 3D Scene Generation within Seconds

FlashWorld introduces a unified, efficient framework for high-fidelity 3D scene generation from a single image or text prompt, achieving significant improvements in both visual quality and inference speed over prior methods. The approach departs from the conventional multi-view-oriented (MV-oriented) paradigm by directly generating 3D Gaussian representations, and further bridges the quality gap between MV- and 3D-oriented pipelines through a novel dual-mode pre-training and cross-mode distillation strategy.

Figure 1: FlashWorld enables fast and high-quality 3D scene generation across diverse scenes.

Background and Motivation

3D scene generation is a central problem for applications in gaming, robotics, and immersive environments. Traditional pipelines either assemble pre-existing 3D assets or reconstruct scenes from multi-view images, but these approaches often lack semantic coherence and multi-view consistency. The dominant MV-oriented paradigm employs a two-stage process: a diffusion model generates multi-view images, followed by 3D reconstruction. However, this leads to geometric and semantic inconsistencies and high computational overhead, with generation times ranging from several minutes to hours.

Recent 3D-oriented methods, which combine differentiable rendering with diffusion models, offer direct 3D scene generation but typically suffer from visual artifacts and require additional refinement, further impacting efficiency. Existing distillation techniques for diffusion models, such as consistency model distillation and distribution matching distillation (DMD), can accelerate inference but tend to amplify the limitations of their respective paradigms.

Figure 2: A brief comparison of different 3D scene generation methods.

Methodology

Dual-Mode Pre-Training

FlashWorld's core innovation is a dual-mode pre-training strategy, leveraging a video diffusion model as the backbone. The model is trained to operate in both MV-oriented and 3D-oriented modes:

MV-oriented mode: Optimized for high visual fidelity, generating multi-view images conditioned on camera parameters and input (image or text).
3D-oriented mode: Outputs 3D Gaussian Splatting (3DGS) parameters, ensuring inherent 3D consistency by decoding multi-view features into 3D Gaussians and rendering novel views.

The denoising network is a Diffusion Transformer (DiT) augmented with 3D attention blocks. Camera parameters are encoded using Reference-Point Plücker Coordinates. The 3DGS decoder is initialized from the latent decoder, with modifications to support the additional Gaussian parameters.

Figure 3: Method overview.

Cross-Mode Post-Training Distillation

To bridge the quality gap, FlashWorld employs a cross-mode post-training distillation strategy:

The MV-oriented mode (teacher) provides high-quality visual supervision.
The 3D-oriented mode (student) learns to match the teacher's distribution via DMD2, inheriting visual fidelity while maintaining 3D consistency.
The student model is distilled into a few-step generator, drastically reducing inference time.

A cross-mode consistency loss regularizes the 3D-oriented mode using predictions from the MV-oriented mode, mitigating artifacts such as floating or duplicated elements.

Out-of-Distribution Data Co-Training

To enhance generalization, FlashWorld incorporates massive single-view images and text prompts with simulated camera trajectories during post-training. This strategy broadens the input distribution, improving robustness to out-of-distribution scenarios and diverse camera paths.

Experimental Results

Image-to-3D Scene Generation

FlashWorld demonstrates superior qualitative results compared to state-of-the-art MV-oriented baselines (CAT3D, Bolt3D, Wonderland), producing scenes with higher fidelity, more accurate geometry, and fewer artifacts. Baselines suffer from blurry textures, missing details, and geometric inconsistencies, while FlashWorld recovers intricate structures and maintains semantic coherence.

Figure 4: Image-to-3D scene generation results of different methods.

Text-to-3D Scene Generation

Against leading text-to-3D methods (Director3D, Prometheus, SplatFlow, VideoRFSplat), FlashWorld achieves more realistic, detailed, and semantically aligned scenes. Baselines exhibit blurry artifacts, incorrect geometries, and poor background realism, whereas FlashWorld generates fine-grained details and robust object structures.

Figure 5: Text-to-3D scene generation results of different methods.

Quantitatively, FlashWorld outperforms all baselines on most quality metrics (Q-Align, CLIP IQA+, CLIP Score) across T3Bench, DL3DV, and WorldScore datasets, while achieving an order-of-magnitude reduction in inference time (9 seconds per scene on a single H20 GPU).

WorldScore Benchmark

On the WorldScore benchmark, FlashWorld attains the highest average score and fastest inference among all compared methods, excelling in style consistency and subjective quality. While slightly trailing in 3D consistency and content alignment (attributable to the lack of explicit depth supervision), FlashWorld's qualitative results reveal fewer unnatural transitions and more faithful scene generation.

Figure 6: 3D scene generation results of different methods on WorldScore benchmark.

Ablation Studies

Ablation experiments confirm the necessity of each component:

MV-oriented models (with or without distillation) yield noisy, inconsistent reconstructions.
3D-oriented models without distillation produce blurry results.
Removing cross-mode consistency loss leads to floating/duplicated artifacts.
Excluding out-of-distribution data reduces semantic alignment and generalization, especially on benchmarks with distribution shifts.

Figure 7: Qualitative ablation studies.

RGBD Rendering and Depth Generalization

Despite the absence of explicit depth supervision, FlashWorld's 3DGS outputs enable the export of depth maps, demonstrating the model's capacity to learn meaningful geometric information from image-only supervision.

Figure 8: RGBD rendering results.

Implementation Details

Architecture: Dual-mode DiT initialized from WAN2.2-5B-IT2V, with 24 input views and a spatial downsampling factor of 16.
Training: Pre-training (20k steps, 3 days) and post-training (10k steps, 2 days) on 64 NVIDIA H20 GPUs using FSDP and activation checkpointing. Batch size 64, bf16 precision.
Datasets: MVImgNet, RealEstate10K, DL3DV10K for multi-view; proprietary and public datasets for out-of-distribution co-training.
Inference: 9 seconds per scene (H20 GPU), supporting both image-to-3D and text-to-3D tasks in a unified model.

Limitations and Future Directions

FlashWorld's performance is still bounded by the diversity and coverage of available multi-view datasets. The model struggles with fine-grained geometry, mirror reflections, and articulated objects. Incorporating depth priors and more 3D-aware structural information may address these issues. Future work includes extending the framework to autoregressive and dynamic 4D scene generation.

Conclusion

FlashWorld establishes a new state-of-the-art in efficient, high-quality 3D scene generation by unifying MV- and 3D-oriented paradigms through dual-mode pre-training and cross-mode distillation. The approach achieves strong numerical results across multiple benchmarks, with significant improvements in inference speed and generalization. The framework is well-suited for real-world applications requiring rapid, high-fidelity 3D scene synthesis and provides a foundation for further advances in 3D and 4D generative modeling.