L-C4: Language-Based Video Colorization for Creative and Consistent Color (2410.04972v2)
Abstract: Automatic video colorization is inherently an ill-posed problem because each monochrome frame has multiple optional color candidates. Previous exemplar-based video colorization methods restrict the user's imagination due to the elaborate retrieval process. Alternatively, conditional image colorization methods combined with post-processing algorithms still struggle to maintain temporal consistency. To address these issues, we present Language-based video Colorization for Creative and Consistent Colors (L-C4) to guide the colorization process using user-provided language descriptions. Our model is built upon a pre-trained cross-modality generative model, leveraging its comprehensive language understanding and robust color representation abilities. We introduce the cross-modality pre-fusion module to generate instance-aware text embeddings, enabling the application of creative colors. Additionally, we propose temporally deformable attention to prevent flickering or color shifts, and cross-clip fusion to maintain long-term color consistency. Extensive experimental results demonstrate that L-C4 outperforms relevant methods, achieving semantically accurate colors, unrestricted creative correspondence, and temporally robust consistency.
- GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Is space-time attention all you need for video understanding? In ICML, 2021.
- Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
- Quo vadis, action recognition? a new model and the Kinetics dataset. In CVPR, 2017.
- L-CoDer: Language-based colorization with color-object decoupling transformer. In ECCV, 2022.
- L-CAD: Language-based colorization with any-level descriptions using diffusion priors. In NeurIPS, 2023a.
- L-CoIns: Language-based colorization with instance awareness. In CVPR, 2023b.
- Language-based image editing with recurrent attentive models. In CVPR, 2018.
- InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In NIPS, 2023.
- Structure and content-guided video synthesis with diffusion models. In ICCV, 2023.
- Measuring colorfulness in natural images. In Human vision and electronic imaging VIII, 2003.
- Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2022.
- Imagen video: High definition video generation with diffusion models, 2022a.
- Video diffusion models. In NeurIPS, 2022b.
- UniColor: A unified framework for multi-modal colorization with transformer. In SIGGRAPH Asia, 2022.
- Scope of validity of PSNR in image/video quality assessment. Electronics letters, 2008.
- Deepremaster: temporal source-reference attention networks for comprehensive video enhancement. ACM TOG, 2019.
- Learning blind video temporal consistency. In ECCV, 2018.
- Fully automatic video colorization with self-regularization and diversity. In CVPR, 2019.
- Blind video temporal consistency via deep video prior. In NeurIPS, 2020.
- Blind video deflickering by neural filtering with a flawed atlas. In CVPR, 2023.
- Control color: Multimodal diffusion-based interactive image colorization. arXiv preprint arXiv:2402.10855, 2024.
- Video colorization with pre-trained text-to-image diffusion models, 2023.
- Temporally consistent video colorization with deep feature propagation and self-regularization learning. CVM, 2024.
- Learning to color from language. In NAACL, 2018.
- A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016.
- FiLM: Visual reasoning with a general conditioning layer. In AAAI, 2018.
- FreeNoise: Tuning-free longer video diffusion via noise rescheduling. In ICLR, 2024.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
- Automatic temporally coherent video colorization. In CRV, 2019.
- FVD: A new metric for video generation. In ICLR, 2019.
- Bringing old films back to life. In CVPR, 2022.
- Gen-L-Video: Multi-text to long video generation via temporal co-denoising. In NeurIPS, 2023a.
- InternVid: A large-scale video-text dataset for multimodal understanding and generation. In ICLR, 2023b.
- Image quality assessment: From error visibility to structural similarity. TIP, 2004.
- L-CoDe: Language-based colorization using color-object decoupled conditions. In AAAI, 2022.
- Make-Your-Video: Customized video generation using textual and structural guidance. IEEE TVCG, 2024.
- VRIPT: A video is worth thousands of words. arXiv preprint arXiv:2406.06040, 2024a.
- BiSTNet: Semantic image prior guided bidirectional temporal feature fusion for deep exemplar-based video colorization. IEEE TPAMI, 2024b.
- Deep exemplar-based video colorization. In CVPR, 2019.
- Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
- The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
- ControlVideo: Training-free controllable text-to-video generation. In ICLR, 2024.
- VCGAN: Video colorization with hybrid generative adversarial network. IEEE TMM, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.