LVCD: Reference-based Lineart Video Colorization with Diffusion Models (2409.12960v1)

Published 19 Sep 2024 in cs.CV and cs.GR

Abstract: We propose the first video diffusion framework for reference-based lineart video colorization. Unlike previous works that rely solely on image generative models to colorize lineart frame by frame, our approach leverages a large-scale pretrained video diffusion model to generate colorized animation videos. This approach leads to more temporally consistent results and is better equipped to handle large motions. Firstly, we introduce Sketch-guided ControlNet which provides additional control to finetune an image-to-video diffusion model for controllable video synthesis, enabling the generation of animation videos conditioned on lineart. We then propose Reference Attention to facilitate the transfer of colors from the reference frame to other frames containing fast and expansive motions. Finally, we present a novel scheme for sequential sampling, incorporating the Overlapped Blending Module and Prev-Reference Attention, to extend the video diffusion model beyond its original fixed-length limitation for long video colorization. Both qualitative and quantitative results demonstrate that our method significantly outperforms state-of-the-art techniques in terms of frame and video quality, as well as temporal consistency. Moreover, our method is capable of generating high-quality, long temporal-consistent animation videos with large motions, which is not achievable in previous works. Our code and model are available at https://luckyhzt.github.io/lvcd.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces a novel video diffusion framework using Sketch-guided ControlNet and Reference Attention to achieve high-quality colorized animations.
The paper employs a sequential sampling mechanism with Overlapped Blending and Prev-Reference Attention to ensure long-term temporal consistency.
The paper demonstrates superior performance over existing methods by excelling in metrics like FID, FVD, PSNR, LPIPS, SSIM, and a new Temporal Consistency metric.

LVCD: Reference-based Lineart Video Colorization with Diffusion Models

The paper "LVCD: Reference-based Lineart Video Colorization with Diffusion Models" presents a novel framework aimed at addressing the challenges in the automatic colorization of lineart animation videos. Unlike traditional methods, which predominantly rely on generative models to individually colorize frames, this paper leverages a video diffusion model to improve temporal consistency and handle significant motions more effectively.

Overview and Key Contributions

The primary contributions of this work are threefold:

Video Diffusion Model: The authors introduce the first video diffusion framework for reference-based lineart video colorization. Utilizing a pretrained video diffusion model, the framework generates high-quality, temporally consistent animations that can accommodate large motions.
Sketch-guided ControlNet and Reference Attention: The Sketch-guided ControlNet extends the existing ControlNet architecture, integrating additional lineart sketch control into the model. Reference Attention is introduced to facilitate long-range spatial matching, enabling successful color propagation across frames with substantial motion.
Sequential Sampling Mechanism: To tackle the challenge of generating extended animations, a novel sequential sampling method is proposed. This method uses Overlapped Blending and Prev-Reference Attention to ensure that long-term temporal consistency is maintained throughout the animation.

Methodology

The framework builds on Stable Video Diffusion (SVD), which itself is composed of a Variational Autoencoder (VAE) and a U-Net. The VAE encoder transforms raw video frames into a latent space while the U-Net is finetuned to denoise sequences of these latents.

The authors address several key challenges:

Incorporating Lineart Control: By extending the ControlNet to operate in a video context, the Sketch-guided ControlNet integrates lineart sketches, providing the layout and structure necessary for animation.
Handling Large Motions: Reference Attention replaces the original spatial attention layers in SVD to support long-range spatial matching. This innovation allows the model to effectively colorize frames that involve significant movement relative to the reference frame.
Generating Long Animations: The limitation of SVD to fixed-length sequences is overcome by introducing sequential sampling mechanisms. Overlapped Blending and Prev-Reference Attention are integrated to facilitate temporal consistency across consecutive segments of the animation.

Results

Extensive experiments demonstrate the effectiveness of the proposed method. Metrics such as FID, FVD, PSNR, LPIPS, SSIM, and an introduced Temporal Consistency (TC) metric show that this method outperforms state-of-the-art techniques across various dimensions, including frame quality, frame similarity, sketch alignment, and temporal consistency. Specifically, the method excels in generating long, temporally consistent animations, a task that previous frameworks could not accomplish efficiently.

Implications and Future Directions

This work has significant practical and theoretical implications. Practically, it offers a tool that can vastly improve productivity in the animation industry by automating the tedious task of frame-by-frame colorization while maintaining high quality and consistency. Theoretically, it pushes forward the boundaries of what can be achieved using diffusion models in the domain of video generation, showcasing their potential to handle complex tasks involving large motions and long sequences.

The paper also opens up multiple avenues for future research. One potential direction is to generalize the framework to other modalities such as edge detection, depth maps, and normal maps, thereby expanding the range of applications. Additionally, incorporating larger, more diverse datasets could further refine the model’s ability to generate high-quality animations across a broader spectrum of styles and contents.

Conclusion

The paper "LVCD: Reference-based Lineart Video Colorization with Diffusion Models" presents a significant advancement in the field of animation video colorization. By leveraging a video diffusion model and introducing innovations such as Sketch-guided ControlNet, Reference Attention, and a novel sequential sampling mechanism, the method achieves high-quality, temporally consistent colorized animations. This work not only provides practical solutions for current limitations in animation production but also sets the stage for future developments in the application of diffusion models to various video generation tasks.

PDF Markdown

Related Papers

GitHub

LVCD | luckyhzt

Tweets

https://twitter.com/_akhaliq/status/1836957960762757229

https://twitter.com/camenduru/status/1840424853750313387

https://twitter.com/camenduru/status/1840416621749313851

https://twitter.com/fly51fly/status/1837246953786331639

https://twitter.com/arXivGPT/status/1837583750546297179

https://twitter.com/javaeeeee1/status/1837240009092383143

YouTube

Show All Videos