Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 86 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 23 tok/s Pro

GPT-5 High 22 tok/s Pro

GPT-4o 73 tok/s Pro

Kimi K2 206 tok/s Pro

GPT OSS 120B 469 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

AnimeColor: Reference-based Animation Colorization with Diffusion Transformers (2507.20158v1)

Published 27 Jul 2025 in cs.CV

Abstract: Animation colorization plays a vital role in animation production, yet existing methods struggle to achieve color accuracy and temporal consistency. To address these challenges, we propose \textbf{AnimeColor}, a novel reference-based animation colorization framework leveraging Diffusion Transformers (DiT). Our approach integrates sketch sequences into a DiT-based video diffusion model, enabling sketch-controlled animation generation. We introduce two key components: a High-level Color Extractor (HCE) to capture semantic color information and a Low-level Color Guider (LCG) to extract fine-grained color details from reference images. These components work synergistically to guide the video diffusion process. Additionally, we employ a multi-stage training strategy to maximize the utilization of reference image color information. Extensive experiments demonstrate that AnimeColor outperforms existing methods in color accuracy, sketch alignment, temporal consistency, and visual quality. Our framework not only advances the state of the art in animation colorization but also provides a practical solution for industrial applications. The code will be made publicly available at \href{https://github.com/IamCreateAI/AnimeColor}{https://github.com/IamCreateAI/AnimeColor}.

Summary

The paper introduces a novel framework that integrates Diffusion Transformers with High-level Color Extractor and Low-level Color Guider to achieve precise animation colorization.
It employs a sophisticated attention mechanism and a multi-stage training strategy to separately optimize geometric and color controls, resulting in improved PSNR, SSIM, and FVD metrics.
Experimental validations and ablation studies confirm AnimeColor’s superior performance over existing methods, highlighting its practical benefits for animation production.

Introduction to AnimeColor

"AnimeColor: Reference-based Animation Colorization with Diffusion Transformers" presents a novel framework leveraging Diffusion Transformers (DiT) for animation colorization. Addressing the inherent challenges of color accuracy and temporal consistency in animation production, the paper introduces a methodology that integrates reference images and sketch sequences into a DiT-based model, facilitating controlled animation generation. The work proposes two pivotal components—the High-level Color Extractor (HCE) for semantic color consistency and the Low-level Color Guider (LCG) for fine-grained color precision—to guide the video diffusion process effectively.

Figure 1: Illustration of the difference between our proposed AnimeColor and other diffusion-based animation colorization methods (CN denotes ControlNet).

Methodological Insights

Framework Utilization

AnimeColor employs Diffusion Transformers to enrich video generation capabilities. The model architecture, outlined in Figure 2, initiates with the concatenation of sketch latents and noise as inputs for DiT, leading to sketch-conditioned video production. The integration of high-level and low-level color references via HCE and LCG enhances the accuracy of color reproduction, improving semantic alignment and detail-level precision.

Figure 2: The framework of our proposed AnimeColor.

Attention Mechanism

The method employs a sophisticated attention mechanism to integrate color control, as depicted in the upper left corner of Figure 2. By concatenating low-level reference tokens with vision and text tokens, and utilizing self-attention and cross-attention blocks, fine-grained and semantic color control is achieved without introducing inconsistencies.

Multi-stage Training Strategy

A four-stage training strategy prevents the intertwining of geometric and color controls during model optimization. This approach ensures each module performs its function without excessive training burdens, promoting stability in temporal consistency and color accuracy.

Experimental Validation

Qualitative Comparisons

AnimeColor demonstrates superior performance compared to alternative methods such as CN+IPA, AniDoc, ToonCrafter, and LVCD across a range of challenging scenarios, as illustrated in Figures 3 and 4. The model excels in maintaining color consistency and visual quality in scenes characterized by large motion and varied sketch inputs.

Figure 3: Qualitative comparison of reference-based colorization with CN+IPA, Concat, Anidoc, ToonCrafter, and LVCD.

Figure 4: Temporal profile comparison of different animation colorization methods.

Quantitative Metrics

The paper reports significant improvements in metrics such as PSNR, SSIM, and FVD, indicating enhanced color accuracy and visual quality relative to existing methods. Table 1 outlines these results, reinforcing the efficacy of the proposed framework.

User Preferences

A user paper (Figure 5) corroborated the model's effectiveness, with AnimeColor preferred by a substantial portion of participants, suggesting its practical applicability in animation production settings.

Figure 5: The results of user paper, where AnimeColor is preferred by the majority of the group.

Ablation Studies

Component Efficacy

Ablation experiments affirm the importance of HCE and LCG, showing marked improvements in performance metrics when these components are integrated. Figures 6, 7, and 8 illustrate how each module contributes to the model's overall success.

Figure 6: Qualitative comparison about ablation paper on model architecture.

Figure 7: Qualitative comparison about ablation paper on sketch conditional modeling.

Figure 8: Qualitative comparison about ablation paper on image encoder in High-level Color Extractor.

Applications

AnimeColor's flexibility is demonstrated through diverse applications, including varying sketch references and natural images, as shown in Figures 9 and 10. This adaptability highlights its potential for broader deployment in animation studios and creative projects.

Figure 9: Illustration of the flexible application with Same Reference image and Different Sketches" andSame Sketches and Different Reference Images" settings.

Figure 10: Impact of different lineart extraction methods.

Figure 11: Colorization performance with natural images as reference.

Conclusion

AnimeColor presents a robust framework for enhancing animation colorization, leveraging the generative capabilities of DiT to address longstanding challenges in the field. Its integration of high-level and low-level color references ensures precise control, making it a valuable tool for animation studios seeking efficient, high-quality production methods. The release of the framework under open-source licensing will likely spur further innovation in the domain, fostering advancements in both industrial applications and academic research.