Papers
Topics
Authors
Recent
Search
2000 character limit reached

ConCLVD: Controllable Chinese Landscape Video Generation via Diffusion Model

Published 19 Apr 2024 in cs.MM | (2404.12903v1)

Abstract: Chinese landscape painting is a gem of Chinese cultural and artistic heritage that showcases the splendor of nature through the deep observations and imaginations of its painters. Limited by traditional techniques, these artworks were confined to static imagery in ancient times, leaving the dynamism of landscapes and the subtleties of artistic sentiment to the viewer's imagination. Recently, emerging text-to-video (T2V) diffusion methods have shown significant promise in video generation, providing hope for the creation of dynamic Chinese landscape paintings. However, challenges such as the lack of specific datasets, the intricacy of artistic styles, and the creation of extensive, high-quality videos pose difficulties for these models in generating Chinese landscape painting videos. In this paper, we propose CLV-HD (Chinese Landscape Video-High Definition), a novel T2V dataset for Chinese landscape painting videos, and ConCLVD (Controllable Chinese Landscape Video Diffusion), a T2V model that utilizes Stable Diffusion. Specifically, we present a motion module featuring a dual attention mechanism to capture the dynamic transformations of landscape imageries, alongside a noise adapter to leverage unsupervised contrastive learning in the latent space. Following the generation of keyframes, we employ optical flow for frame interpolation to enhance video smoothness. Our method not only retains the essence of the landscape painting imageries but also achieves dynamic transitions, significantly advancing the field of artistic video generation. The source code and dataset are available at https://anonymous.4open.science/r/ConCLVD-EFE3.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. ArXiv, 2023.
  2. Sundial-gan: A cascade generative adversarial networks framework for deciphering oracle bone inscriptions. In Proceedings of the 30th ACM International Conference on Multimedia, 2022.
  3. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, 2020.
  4. Generating long sequences with sparse transformers. ArXiv, 2019.
  5. Efficiency-optimized video diffusion models. Proceedings of the 31st ACM International Conference on Multimedia, 2023.
  6. Mv-diffusion: Motion-aware video diffusion model. Proceedings of the 31st ACM International Conference on Multimedia, 2023.
  7. Tim Dettmers. The best gpus for deep learning in 2023 — an in-depth analysis, 2023.
  8. Diffusion models beat gans on image synthesis, 2021.
  9. Structure and content-guided video synthesis with diffusion models, 2023.
  10. Institute for Intelligent Computing. Text-to-video-synthesis model in open domain — modelscope.cn. https://modelscope.cn/models/iic/text-to-video-synthesis/summary, 2024.
  11. Tokenflow: Consistent diffusion features for consistent video editing. ArXiv, 2023.
  12. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. ArXiv, 2023.
  13. Imagine this! scripts to compositions to videos. In European Conference on Computer Vision, 2018.
  14. Prompt-to-prompt image editing with cross attention control. ArXiv, 2022.
  15. Asymmetric bilateral motion estimation for video frame interpolation. 2021.
  16. Real-time intermediate flow estimation for video frame interpolation. In European Conference on Computer Vision, 2020.
  17. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, 2023.
  18. Autodiffusion: Training-free optimization of time steps and architectures for automated diffusion model acceleration. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, 2023.
  19. Transform a simple sketch to a chinese painting by a multiscale deep neural network. Algorithms, 2018.
  20. Animatediff-lightning: Cross-model diffusion distillation, 2024.
  21. Cross-modal dual learning for sentence-to-video generation. Proceedings of the 27th ACM International Conference on Multimedia, 2019.
  22. Follow your pose: Pose-guided text-to-video generation using pose-free videos. In AAAI Conference on Artificial Intelligence, 2023.
  23. Attentive semantic video generation using captions. 2017 IEEE International Conference on Computer Vision (ICCV), 2017.
  24. Sync-draw: Automatic video generation using deep recurrent attentive architectures. Proceedings of the 25th ACM International Conference on Multimedia, 2016.
  25. To create what you tell: Generating videos from captions. Proceedings of the 25th ACM International Conference on Multimedia, 2017.
  26. Pika labs. Pika — pika.art. https://pika.art/, 2024. [Accessed 25-03-2024].
  27. Fatezero: Fusing attentions for zero-shot text-based video editing. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  28. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
  29. High-resolution image synthesis with latent diffusion models. pages 10674–10685, 2021.
  30. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015 - 18th International Conference Munich, Germany, October 5 - 9, 2015, Proceedings, Part III, 2015.
  31. RunwayML Research Team. Gen-2 by Runway — research.runwayml.com. https://research.runwayml.com/gen2, 2024. [Accessed 25-03-2024].
  32. Make-a-video: Text-to-video generation without text-video data. 2023.
  33. Denoising diffusion implicit models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021.
  34. Raft: Recurrent all-pairs field transforms for optical flow. In European Conference on Computer Vision, 2020.
  35. Attention is all you need. In Neural Information Processing Systems, 2017.
  36. Animatelcm: Accelerating the animation of personalized diffusion models and adapters with decoupled consistency learning, 2024.
  37. Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. ArXiv, 2023.
  38. Chinastyle: A mask-aware generative adversarial network for chinese traditional image translation. SIGGRAPH Asia 2019 Technical Briefs, 2019.
  39. Cclap: Controllable chinese landscape painting generation via latent diffusion model. 2023 IEEE International Conference on Multimedia and Expo (ICME), 2023.
  40. Lamp: Learn a motion pattern for few-shot-based video generation. ArXiv, 2023.
  41. Make-your-video: Customized video generation using textual and structural guidance. IEEE Transactions on Visualization and Computer Graphics, 2023.
  42. Alice Xue. End-to-end chinese landscape painting creation using generative adversarial networks. 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), 2020.
  43. Seco: Exploring sequence supervision for unsupervised representation learning. In AAAI Conference on Artificial Intelligence, 2020.
  44. Learning to generate poetic chinese landscape painting with calligraphy. 2022.
  45. Clearer frames, anytime: Resolving velocity ambiguity in video frame interpolation, 2023.
  46. An interactive and generative approach for chinese shanshui painting document. 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019.
  47. Magicvideo: Efficient video generation with latent diffusion models. ArXiv, 2022.

Summary

  • The paper presents ConCLVD, a diffusion model that transforms static Chinese landscape images into dynamic, high-fidelity videos through trainable motion modules.
  • It integrates a dual-attention mechanism, utilizing versatile and sparse-causal attention to ensure both global coherence and local temporal continuity.
  • The authors developed the CLV-HD dataset and introduced contrastive learning and SOFP for effective frame interpolation and smoother video transitions.

ConCLVD: Controllable Chinese Landscape Video Generation via Diffusion Model

Introduction

The paper introduces ConCLVD, a novel approach for generating videos in the style of Chinese landscape paintings using text-to-video (T2V) diffusion models. Bridging traditional Chinese art with modern technology, this method employs the Stable Diffusion framework, augmented by a motion module and contrastive learning strategies, to transform static landscape imagery into dynamic video compositions.

Architecture Overview

ConCLVD integrates a trainable motion module and a noise adapter within a frozen Stable Diffusion framework to accommodate dynamic video generation. The design encapsulates a dual-attention mechanism—Versatile Attention and Sparse-Causal Attention—to effectively manage temporal data sequences. Versatile Attention connects every frame in a sequence globally, whereas Sparse-Causal Attention focuses on local causality, ensuring each frame only refers to its predecessor for continuity. Figure 1

Figure 1: Overview of ConCLVD. Left: the architecture. ConCLVD integrates a trainable motion module based on a frozen Stable Diffusion and introduces a noise adapter to accommodate contrast learning of noise in latent space. Right: the inference framework.

Motion modules are strategically placed to facilitate the seamless transition between frames, capturing the dynamic elements inherent in Chinese landscape painting videos. This architecture maintains the artistic essence while modeling motion across frames.

Dataset Development

The authors introduced CLV-HD, a specialized dataset comprising 1,300 text-video pairs that reflect the stylized diversity of Chinese landscape paintings. This dataset addresses the previous lack of resources, enabling models to learn intrinsic art forms' unique characteristics through deep learning. Comprising both traditional and modern aesthetics, this dataset serves as a pivotal resource for advancing artistic video synthesis technologies.

Methodological Innovations

A prominent feature of ConCLVD is its incorporation of contrastive learning within the latent space. Consecutive noise vectors, corresponding to sequential video frames, are treated as positive pairs, while widely spaced frames form negative pairs, enhancing the model's ability to discern temporal nuances through unsupervised learning methods. Figure 2

Figure 2: Design of Motion Module. The motion module is inserted following each image layer of the pre-trained SD to process video data.

Additionally, the introduction of Sparse Optical Flow Projection (SOFP) for frame interpolation effectively improves video smoothness. SOFP uses optical flow projections to interpolate between frames, allowing ConCLVD to produce coherent and fluid video sequences. Figure 3

Figure 3: Detailed explanation of the attention mechanism. The above represents Versatile Attention, where each frame is related to every other frame; the below represents Sparse-Causal Attention, where each frame only focuses on its previous frame.

Experimental Validation

The experimental framework was deployed on an NVIDIA RTX 3090, showcasing ConCLVD's capability to produce high-definition video content with limited computational resources. The video outputs achieved high fidelity in terms of ink style representation and dynamic scene transitions consistent with traditional Chinese landscapes.

Quantitative comparisons against baseline models highlighted ConCLVD's superior performance on metrics such as temporal consistency and stylistic authenticity. Noteworthy is its ability to replicate subtle artistic nuances across video sequences, outperforming prior models in aesthetic preservation and text alignment. Figure 4

Figure 4: Illustration of SOFP Interpolation. V(t→T)V_{(t \rightarrow T)} denotes the optical flow used for interpolation between frames.

Conclusion

ConCLVD represents a significant advancement in the fusion of Chinese landscape art with digital video synthesis, enhancing cultural preservation through contemporary mediums. Its novel approach via diffusion models, augmented motion modules, and dynamic interpolation strategies, extend the boundaries of artistic video generation. The release of the CLV-HD dataset and the open-source nature of ConCLVD promises further exploration and development in the field, reinforcing its role as an invaluable tool for bridging traditional artistry with emerging technologies.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.