- The paper introduces DreamRunner, which integrates LLM-based hierarchical planning and retrieval-augmented test-time adaptation to generate coherent multi-scene storytelling videos.
- It employs a novel SR3AI module with region-based 3D attention to ensure precise object-motion binding and frame-by-frame semantic continuity.
- Experimental results demonstrate significant gains in character consistency and transition smoothness, outperforming existing state-of-the-art SVG methods.
The paper "DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation" presents an innovative approach to storytelling video generation (SVG) through the use of LLMs for planning and retrieval-augmented test-time adaptation for motion customization. The proposed method, DreamRunner, addresses the complexities of generating multi-scene, multi-object, and motion-based videos from text scripts by incorporating a structured three-stage framework.
Overview of DreamRunner:
- Plan Generation: The framework begins with the generation of a hierarchical video plan from a given story narration using an LLM. It creates a high-level and fine-grained plan that facilitates character-driven events and precise entity motions, ensuring narrative coherence across scenes.
- Motion Retrieval and Prior Learning: In the motion retrieval stage, relevant videos are automatically retrieved from a large-scale database based on the desired motions. These videos are used to fine-tune motion priors through test-time adaptation. Simultaneously, reference images are employed to learn subject priors to maintain consistent appearance across scenes.
- Video Generation with Region-Based Diffusion: This stage involves a region-based 3D attention and prior injection module (SR3AI) integrated within a diffusion model. SR3AI provides fine-grained control over object-motion binding and frame-by-frame semantic continuity, thus achieving precise and consistent video generation.
Technical Contributions:
- Hierarchical LLM Planning: The use of LLMs for dual-level planning, involving a broad narrative scope followed by detailed entity-specific plans, enhances the ability to generate coherent long-form videos aligned with story scripts.
- Retrieval-Augmented Adaptation: The pipeline automatically retrieves videos via a combination of BM25 ranking and attribute-based filtering, ensuring relevant and motion-rich examples for training motion priors dynamically. This adaptation, combined with parameter-efficient tuning (e.g., LoRA), enhances motion depiction fidelity.
- SR3AI Module: The novel spatial-temporal region-based 3D attention mechanism within the diffusion model allows precise semantic control and prevents interference across multiple objects and motions. Fine-grained object-and-motion-level control is achieved, leading to seamless transitions and consistent character depiction.
Experimental Results:
The paper demonstrates DreamRunner's superior performance over state-of-the-art SVG methods such as VideoDirectorGPT and VLogger across several benchmarks including the newly introduced DreamStorySet dataset. Specifically:
- Character Consistency: Achieved a 13.1% relative improvement in CLIP scores and a 33.4% improvement in DINO scores, underscoring successful subject prior learning.
- Text Alignment and Transition: Marked improvements in both fine-grained and full prompt adherence in terms of CLIP and ViCLIP scores. Additionally, a 27.2% enhancement in transition smoothness scores highlights the effectiveness of the SR3AI module in ensuring narrative coherence.
The ultimate validation of DreamRunner's capabilities lies in its robust performance on compositional text-to-video generation tasks, as it shows strong results across compositional metrics on T2V-CompBench. The paper emphasizes DreamRunner's potential in elevating open-source models towards closed-source performance levels in dynamic attributes and spatial relationships.
In conclusion, DreamRunner represents a comprehensive, well-structured approach leveraging LLMs and retrieval-augmented techniques for crafting detailed, immersive storytelling videos, addressing critical challenges in SVG with empirical evidence of its efficacy.