Think Before You Diffuse: LLMs-Guided Physics-Aware Video Generation (2505.21653v1)

Published 27 May 2025 in cs.CV

Abstract: Recent video diffusion models have demonstrated their great capability in generating visually-pleasing results, while synthesizing the correct physical effects in generated videos remains challenging. The complexity of real-world motions, interactions, and dynamics introduce great difficulties when learning physics from data. In this work, we propose DiffPhy, a generic framework that enables physically-correct and photo-realistic video generation by fine-tuning a pre-trained video diffusion model. Our method leverages LLMs to explicitly reason a comprehensive physical context from the text prompt and use it to guide the generation. To incorporate physical context into the diffusion model, we leverage a Multimodal LLM (MLLM) as a supervisory signal and introduce a set of novel training objectives that jointly enforce physical correctness and semantic consistency with the input text. We also establish a high-quality physical video dataset containing diverse phyiscal actions and events to facilitate effective finetuning. Extensive experiments on public benchmarks demonstrate that DiffPhy is able to produce state-of-the-art results across diverse physics-related scenarios. Our project page is available at https://bwgzk-keke.github.io/DiffPhy/

Summary

Think Before You Diffuse: LLMs-Guided Physics-Aware Video Generation

The paper "Think Before You Diffuse: LLMs-Guided Physics-Aware Video Generation" introduces DiffPhy, an innovative framework that incorporates LLMs and multimodal LLMs (MLLMs) into the video generation process, specifically targeting the synthesis of physically accurate and visually coherent videos from text prompts. Video generation models have demonstrated prowess in creating visually appealing content, but they commonly neglect the incorporation of real-world physics, which is a significant limitation that DiffPhy aims to address.

Key Contributions and Methodology

DiffPhy emerges as a comprehensive solution that enhances physics awareness in video generation through a sequence of methodical steps:

LLM-Guided Reasoning: By leveraging LLMs, particularly models like GPT-4, DiffPhy initiates its process by analyzing the input text prompt to extract implicit physical context and phenomena. This step involves identifying the key physical principles and entities involved in the scenario. The framework uses Chain-of-Thought (CoT) prompting to deconstruct the input into components like forces and kinematic relationships, which are then synthesized into enhanced prompts that guide the video generation model.
Enhanced Prompt Usage: The deciphered physical context is integrated into the prompts to provide explicit physical cues during video synthesis. This task-specific enrichment ensures that the diffusion model is aware of desired physical interactions, thereby bridging the gap between textual and physical cues.
Fine-Tuning with MLLMs Supervision: DiffPhy fine-tunes a pre-trained video diffusion model using the enhanced prompts while employing MLLMs as supervisory agents. These models assess the physical plausibility and semantic alignment of the generated content with the input prompt. The MLLM supervision is implemented through a set of loss functions - physical phenomena loss, commonsense loss, and semantic consistency loss, which collectively enforce the model to adhere to not only physical but also contextual semantics.
Creation of High-Quality Video Dataset: Recognizing the limitations of synthetic datasets, the authors introduce a curated dataset of real-world videos covering diverse physical scenarios. This data-driven approach supports the effective fine-tuning of the video generation model, enhancing its capability to tackle complex physical phenomena in realistic settings.

Empirical Evaluation

The framework's effectiveness was thoroughly evaluated on public benchmarks like VideoPhy2 and PhyGenBench, demonstrating superiority over existing models. DiffPhy exhibited substantial improvements in maintaining physical coherence and semantic consistency compared to mainstream diffusion-based video generators such as Wan 2.1-14B and others. Notably, the system excelled in aligning with physical commonsense across domains including mechanics, optics, and thermal processes.

Implications and Future Work

The integration of LLMs for physical context extraction and the strategic use of MLLMs for fine-tuning mark a significant stride toward achieving physics-aware video generation. Beyond immediate applications in content creation and virtual simulations, these advancements bear implications for fields like robotics, where realistic physical environment modeling is crucial.

Future research directions hinted by the authors include scaling up the dataset to further diversify the training scenarios and refining the multimodal evaluation to bolster the robustness of video-based physical reasoning. Enhanced datasets alongside more sophisticated training paradigms may empower DiffPhy to generate even more nuanced and contextually rich video content.

In conclusion, DiffPhy represents a meaningful advancement toward bridging the existing gap between text-based prompts and physically coherent video generation, providing a framework adaptable for a variety of complex real-world simulations where authenticity in visual content is paramount.

Related Papers

GitHub

Think Before You Diffuse: LLMs-guided Physics-Aware Video Generation

Tweets

https://twitter.com/banodoco/status/1928101461025599897

YouTube

Show All Videos