BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations

Published 13 Jan 2025 in cs.CV and cs.AI | (2501.07647v1)

Abstract: Existing video generation models struggle to follow complex text prompts and synthesize multiple objects, raising the need for additional grounding input for improved controllability. In this work, we propose to decompose videos into visual primitives - blob video representation, a general representation for controllable video generation. Based on blob conditions, we develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance. In particular, we introduce a masked 3D attention module that effectively improves regional consistency across frames. In addition, we introduce a learnable module to interpolate text embeddings so that users can control semantics in specific frames and obtain smooth object transitions. We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and DiT-based video diffusion models. Extensive experimental results show that BlobGEN-Vid achieves superior zero-shot video generation ability and state-of-the-art layout controllability on multiple benchmarks. When combined with an LLM for layout planning, our framework even outperforms proprietary text-to-video generators in terms of compositional accuracy.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces BlobGEN-Vid, a novel text-to-video generation approach that uses unique "blob video representations" to achieve enhanced compositional control over objects.
BlobGEN-Vid employs a model-agnostic architecture with masked 3D attention and context-interpolation modules to improve spatial-temporal consistency and semantic smoothness.
Experimental results demonstrate that BlobGEN-Vid significantly improves video quality (FVD) and compositional controllability (mIOU, CLIP alignment) compared to previous methods on complex text prompts.

Overview of BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations

The field of text-to-video generation has experienced significant advancements in generating realistic and visually compelling videos. However, existing models often struggle with comprehending complex text prompts or synthesizing videos involving multiple objects, due to limitations in compositional control. The paper introduces BlobGEN-Vid, a novel approach that addresses these challenges by leveraging blob video representations to improve fine-grained control over video generation processes, facilitating superior zero-shot video generation and enhanced layout controllability.

BlobGEN-Vid employs a unique decomposition of videos into visual primitives known as "blob video representations." This adaptive representation framework enables enhanced control over object movements and visual appearance. Each blob encapsulates parameters defining an object's spatial characteristics through a tilted ellipse (center, size, and orientation) as well as descriptive language attributes of the object’s visual properties. This dual-component structure empowers users to exert nuanced control over motion and semantics in video compositions, engaging on both temporal and spatial dimensions.

Central to the BlobGEN-Vid framework is its model-agnostic architecture, which integrates with U-Net and diffusion transformer (DiT) based diffusion models. It incorporates advancements such as a masked 3D attention module, which enhances spatial-temporal consistency, and a context-interpolation module for ensuring semantic smoothness across frames. These components collectively enable the model to achieve state-of-the-art performance in layout-guided video generation tasks.

Experimental results provided in the paper demonstrate that BlobGEN-Vid not only leads to significant improvements in terms of traditional quality metrics like FVD (Frechet Video Distance) but also excels in specific controllability measures like mean Intersection over Union (mIOU) and CLIP-based semantic alignment scores. These metrics validate that the proposed model significantly advances the precision and reliability of video content generation conforming to complex textual inputs.

Moreover, BlobGEN-Vid's architecture stands out for its innovative use of LLMs for generating blob layouts, a feature that could pave the way for AI that more intelligently interprets and outputs user-intended complex scenes. This component, coupled with the model’s existing strengths, positions BlobGEN-Vid as a competitive alternative to proprietary systems, surpassing them in compositional challenges as confirmed through benchmarks like T2V-CompBench and TC-Bench, focusing on aspects such as dynamic binding and object relational accuracy.

The implications of BlobGEN-Vid are extensive, with potential enhancements in applications requiring detailed video content synthesis from descriptions, ranging from creative media production to autonomous systems requiring advanced scene understanding. The research presents a considerable contribution to advancing both theoretical and applied models of video synthesis, potentially setting new standards for how AI systems interpret and transform textual data into visual narratives.

Future research efforts might build upon BlobGEN-Vid by improving the efficiency of blob video representation extraction, optimizing blob-grounding strategies, or further aligning such models with cognitive-based understanding frameworks for context-based visual generation tasks.

Markdown Report Issue