StyleMaster: Stylize Your Video with Artistic Generation and Translation (2412.07744v1)

Published 10 Dec 2024 in cs.CV

Abstract: Style control has been popular in video generation models. Existing methods often generate videos far from the given style, cause content leakage, and struggle to transfer one video to the desired style. Our first observation is that the style extraction stage matters, whereas existing methods emphasize global style but ignore local textures. In order to bring texture features while preventing content leakage, we filter content-related patches while retaining style ones based on prompt-patch similarity; for global style extraction, we generate a paired style dataset through model illusion to facilitate contrastive learning, which greatly enhances the absolute style consistency. Moreover, to fill in the image-to-video gap, we train a lightweight motion adapter on still videos, which implicitly enhances stylization extent, and enables our image-trained model to be seamlessly applied to videos. Benefited from these efforts, our approach, StyleMaster, not only achieves significant improvement in both style resemblance and temporal coherence, but also can easily generalize to video style transfer with a gray tile ControlNet. Extensive experiments and visualizations demonstrate that StyleMaster significantly outperforms competitors, effectively generating high-quality stylized videos that align with textual content and closely resemble the style of reference images. Our project page is at https://zixuan-ye.github.io/stylemaster

Summary

The paper introduces a novel style extraction module and motion adapter that enhance style consistency while preserving video content.
It employs a dual cross-attention strategy with contrastive learning to achieve superior ArtFID scores and dynamic coherence compared to current methods.
The authors propose a gray tile ControlNet for precise content control, effectively minimizing content leakage in stylized video translation.

Overview of the Paper "StyleMaster: Stylize Your Video with Artistic Generation and Translation"

This paper, titled "StyleMaster: Stylize Your Video with Artistic Generation and Translation," presents an innovative approach to video style transfer that emphasizes local texture preservation and effective style extraction. The authors introduce StyleMaster, a methodological framework that excels in generating stylized videos by effectively maintaining temporal coherence and high fidelity to both style and content.

Key Contributions

The authors identify several limitations in existing video stylization methods, such as inadequate style consistency and content leakage. To address these, the paper makes the following key contributions:

Style Extraction Module: A novel style extraction mechanism is introduced, which leverages both local and global image features. Local features are selected based on low similarity to text prompts to preserve texture while avoiding content leakage. Global features are derived through a contrastive learning framework using a novel paired dataset generated via "model illusion," ensuring style consistency.
Motion Adapter: The paper proposes the use of a lightweight motion adapter trained on still videos to bridge the gap in stylization between static images and dynamic videos, enhancing the temporal quality without requiring real video datasets for training.
Gray Tile ControlNet: To improve content control in stylized video translation, the authors design a gray tile ControlNet that serves as a simplified and precise mechanism for content guidance, overcoming the drawbacks of previous methods that overly relied on depth information.

Methodology

The framework employs an innovative approach that combines both contrastive learning for global style features and careful selection of local texture cues. Here are some of the methodological highlights:

Model Illusion for Dataset Creation: The authors exploit the "illusion property" of pixel-rearranged paired images to synthesize training data that guarantees style consistency, thus facilitating the effective training of the global style extractor.
Dual Attention Strategy: The framework utilizes a dual cross-attention strategy in an adapter-based architecture, enabling the StyleMaster to inject extracted style information while preserving video content.
Negative Motion Adapter Indices: By adjusting the motion adapter's ratio to negative values during inference, the model not only enhances motion dynamics but also stylization fidelity, taking generated results further from real-world domains for a more profound artistic effect.

Experimental Results

The paper includes extensive empirical results and comparisons with existing state-of-the-art methods, such as StyleCrafter and VideoComposer, across both image and video tasks, demonstrating clear superiority in style fidelity and content preservation:

Image Style Transfer: StyleMaster notably surpasses competitors in metrics like ArtFID, indicating a balanced preservation of style resemblance and content accuracy.
Video Stylization: The model outperforms in style-video alignment and dynamic quality, showing its robustness in maintaining style consistency over time.

Implications and Future Directions

The paper advances theoretical and practical implications for artistic style transfer in videos, adding tools for the broader domain of computer vision and multimedia applications. Future work could explore the quantification and integration of particle effects and motion dynamics, which are not fully addressed in the current stylistic framework. Additionally, expanding beyond style images to reference video dynamics could forge new frontiers in immersive, real-time video editing and animation techniques. As this domain progresses, the incorporation of diverse training paradigms and multi-modal datasets could yield further enhancements in video generation technologies and their application in creative industries.

Conclusion

The paper presents a comprehensive methodology for video style transfer, striking a beneficial balance between preserving style consistency and avoiding content leakage. It successfully elevates the capabilities of video generation models, offering a novel approach through StyleMaster that could be pivotal for future research in dynamic stylization and translation techniques. By addressing longstanding challenges in style transfer, the authors contribute significantly to the field, paving the way for more innovative, coherent video stylization applications.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/ai_bites/status/1866942744742039720

Reddit

[2412.07744] StyleMaster: Stylize Your Video with Artistic Generation and Translation (1 point, 0 comments)