DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation (2411.16657v3)

Published 25 Nov 2024 in cs.CV, cs.AI, and cs.CL

Abstract: Storytelling video generation (SVG) aims to produce coherent and visually rich multi-scene videos that follow a structured narrative. Existing methods primarily employ LLM for high-level planning to decompose a story into scene-level descriptions, which are then independently generated and stitched together. However, these approaches struggle with generating high-quality videos aligned with the complex single-scene description, as visualizing such complex description involves coherent composition of multiple characters and events, complex motion synthesis and muti-character customization. To address these challenges, we propose DreamRunner, a novel story-to-video generation method: First, we structure the input script using a LLM to facilitate both coarse-grained scene planning as well as fine-grained object-level layout and motion planning. Next, DreamRunner presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos, thus facilitating the generation of new videos with complex, scripted motions. Lastly, we propose a novel spatial-temporal region-based 3D attention and prior injection module SR3AI for fine-grained object-motion binding and frame-by-frame semantic control. We compare DreamRunner with various SVG baselines, demonstrating state-of-the-art performance in character consistency, text alignment, and smooth transitions. Additionally, DreamRunner exhibits strong fine-grained condition-following ability in compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench. Finally, we validate DreamRunner's robust ability to generate multi-object interactions with qualitative examples.

Summary

The paper introduces DreamRunner, which integrates LLM-based hierarchical planning and retrieval-augmented test-time adaptation to generate coherent multi-scene storytelling videos.
It employs a novel SR3AI module with region-based 3D attention to ensure precise object-motion binding and frame-by-frame semantic continuity.
Experimental results demonstrate significant gains in character consistency and transition smoothness, outperforming existing state-of-the-art SVG methods.

The paper "DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation" presents an innovative approach to storytelling video generation (SVG) through the use of LLMs for planning and retrieval-augmented test-time adaptation for motion customization. The proposed method, DreamRunner, addresses the complexities of generating multi-scene, multi-object, and motion-based videos from text scripts by incorporating a structured three-stage framework.

Overview of DreamRunner:

Plan Generation: The framework begins with the generation of a hierarchical video plan from a given story narration using an LLM. It creates a high-level and fine-grained plan that facilitates character-driven events and precise entity motions, ensuring narrative coherence across scenes.
Motion Retrieval and Prior Learning: In the motion retrieval stage, relevant videos are automatically retrieved from a large-scale database based on the desired motions. These videos are used to fine-tune motion priors through test-time adaptation. Simultaneously, reference images are employed to learn subject priors to maintain consistent appearance across scenes.
Video Generation with Region-Based Diffusion: This stage involves a region-based 3D attention and prior injection module (SR3AI) integrated within a diffusion model. SR3AI provides fine-grained control over object-motion binding and frame-by-frame semantic continuity, thus achieving precise and consistent video generation.

Technical Contributions:

Hierarchical LLM Planning: The use of LLMs for dual-level planning, involving a broad narrative scope followed by detailed entity-specific plans, enhances the ability to generate coherent long-form videos aligned with story scripts.
Retrieval-Augmented Adaptation: The pipeline automatically retrieves videos via a combination of BM25 ranking and attribute-based filtering, ensuring relevant and motion-rich examples for training motion priors dynamically. This adaptation, combined with parameter-efficient tuning (e.g., LoRA), enhances motion depiction fidelity.
SR3AI Module: The novel spatial-temporal region-based 3D attention mechanism within the diffusion model allows precise semantic control and prevents interference across multiple objects and motions. Fine-grained object-and-motion-level control is achieved, leading to seamless transitions and consistent character depiction.

Experimental Results:

The paper demonstrates DreamRunner's superior performance over state-of-the-art SVG methods such as VideoDirectorGPT and VLogger across several benchmarks including the newly introduced DreamStorySet dataset. Specifically:

Character Consistency: Achieved a 13.1% relative improvement in CLIP scores and a 33.4% improvement in DINO scores, underscoring successful subject prior learning.
Text Alignment and Transition: Marked improvements in both fine-grained and full prompt adherence in terms of CLIP and ViCLIP scores. Additionally, a 27.2% enhancement in transition smoothness scores highlights the effectiveness of the SR3AI module in ensuring narrative coherence.

The ultimate validation of DreamRunner's capabilities lies in its robust performance on compositional text-to-video generation tasks, as it shows strong results across compositional metrics on T2V-CompBench. The paper emphasizes DreamRunner's potential in elevating open-source models towards closed-source performance levels in dynamic attributes and spatial relationships.

In conclusion, DreamRunner represents a comprehensive, well-structured approach leveraging LLMs and retrieval-augmented techniques for crafting detailed, immersive storytelling videos, addressing critical challenges in SVG with empirical evidence of its efficacy.

PDF Markdown

Related Papers

Tweets

https://twitter.com/WilliamLamkin/status/1861493549801275426

https://twitter.com/TheTuringPost/status/1864483143107137966

https://twitter.com/arXivGPT/status/1861837405050822938