AnimeColor Diffusion Framework

Updated 23 September 2025

AnimeColor is a reference-based animation colorization framework that uses Diffusion Transformers and dual-level color cues to achieve temporally consistent video generation.
It separates sketch structure from color transfer via a four-stage training process, employing both a High-level Color Extractor and a Low-level Color Guider for semantic and detailed color control.
Empirical evaluations show superior color accuracy, sketch alignment, and temporal stability compared to previous methods, making it ideal for advanced animation production.

AnimeColor is a reference-based animation colorization framework that leverages Diffusion Transformers (DiT) and dual-level color guidance to generate temporally consistent, sketch-controlled animation videos. Designed to address the challenges of accurate color transfer and temporal consistency in animation production, AnimeColor incorporates both high-level semantic and low-level detailed color cues from a reference image, applied to a sequence of input sketches. The architecture combines a DiT-based video diffusion backbone with specialized modules for color extraction and a multi-stage training regimen that clearly separates sketch and color control, resulting in a system that outperforms prior state-of-the-art methods in terms of color accuracy, alignment to sketch structure, and temporal stability (Zhang et al., 27 Jul 2025).

1. Framework Architecture and Workflow

AnimeColor accepts as input a single reference image $I_\text{ref}$ and a sequence of binary sketch images $S = \{I_\text{skt}^1, I_\text{skt}^2, ..., I_\text{skt}^N\}$ corresponding to the frames of the target animation. The sketches are encoded using a VAE encoder and concatenated with a noise latent, forming the latent input to a Diffusion Transformer. The reference image is independently processed via two modules: the High-level Color Extractor (HCE) and the Low-level Color Guider (LCG), each capturing different levels of chromatic information. Their outputs are injected as cross-attention inputs into designated layers of the DiT backbone.

The overall animation generation pipeline is as follows:

Encoding: Sketch sequences are encoded and combined with Gaussian noise to prepare input latents for denoising.
Color Feature Extraction: The reference image is processed by the HCE and LCG to obtain semantic (Fₕ) and fine-grained detail tokens (from DiT copies).
Conditioned Generation: The DiT-based diffusion model, during denoising, receives as conditioning both the sketch structure and the dual-level color guidance, producing each video frame.
Multi-Stage Training: The system is trained in four stages, alternately optimizing the main DiT, HCE, and LCG modules, before joint fine-tuning for maximal synergy and temporal consistency.

This architecture enables the model to disentangle geometric guidance (from sketches) and color/style control (from the reference), maximizing both color faithfulness and animation fidelity.

2. Diffusion Transformer Backbone

The AnimeColor framework replaces traditional U-Net-based backbones with Diffusion Transformers, which have shown superior generative and temporal reasoning capabilities in video diffusion tasks. The DiT model operates on sequences of noised sketch latents, augmented by:

In-Context Learning: DiT employs long-range attention, effectively propagating both sketch structure and reference color information across all frames, crucial for maintaining visual continuity in large motion scenes.
Multi-Modal Attention: DiT receives, at each layer, a concatenation of sketch, color (HCE, LCG), and text tokens, allowing it to jointly reason about structure and style throughout the denoising process.

The DiT attention mechanism undergirds the model’s ability to learn correspondence between temporally adjacent frames and to stably integrate complex, variable-length guidance signals.

3. Dual-Level Color Guidance: HCE and LCG

High-level Color Extractor (HCE)

HCE extracts semantic color representations from the reference using a visual foundation model (RADIO), outputting summary ( $F_\text{sum}$ ) and spatial ( $F_\text{spa}$ ) features. These are concatenated with learnable color query tokens ( $F_\text{query}$ ), and a self-attention Q-Former produces the high-level color reference output $F_h$ . This semantic control is then injected via cross-attention:

$F_\text{out} = F_\text{in} + \text{Cross-Attention}(F_\text{in}, F_h)$

Low-level Color Guider (LCG)

LCG recovers fine-grained color textures lost to semantic abstraction by extracting intermediate DiT representations from the reference. These detail tokens are merged with text and other vision tokens via self-attention, allowing precise transfer of local colors, edges, and variations.

The HCE and LCG are trained in isolation before joint fine-tuning, ensuring that the model’s layers separately master semantic and detailed color transfer.

4. Multi-Stage Training Regimen

AnimeColor’s four-stage strategy is designed to prevent conflicts between geometric and chromatic signals and to decouple their optimization paths:

Stage	Optimized Modules	Loss Formula	Objective
1	Main DiT	$\mathcal{L}_1 = \mathbb{E}\left[\\|\epsilon - \epsilon_\theta(z_t,\ldots)\\|^2_2\right]$	Sketch-guided video generation
2	HCE (Q-Former, CA)	$\mathcal{L}_2 = \mathbb{E}\left[\\|\epsilon - \epsilon_\theta(z_t,\ldots,\mathcal{E}_h(I_\text{ref}))\\|^2_2\right]$	High-level color extraction
3	LCG (aux. DiT)	$\mathcal{L}_3 = \mathbb{E}\left[\\|\epsilon - \epsilon_\theta(z_t,\ldots,\mathcal{E}_l(I_\text{ref}))\\|^2_2\right]$	Low-level color guiding
4	All (joint FT)	$\mathcal{L}_4 = \mathbb{E}\left[\\|\epsilon - \epsilon_\theta(z_t,\ldots,\mathcal{E}_h(I_\text{ref}), \mathcal{E}_l(I_\text{ref}))\\|^2_2\right]$	Joint integration

Each loss is a squared error between predicted and true diffusion noise, conditioned on the training-stage-specific set of inputs. The staged optimization avoids collapse of representation or interference between sketch and reference signals.

5. Quantitative and Qualitative Performance

AnimeColor is empirically validated on animation colorization benchmarks with the following findings:

Quantitative Superiority: Measured by PSNR, SSIM, LPIPS, FID, and FVD, AnimeColor achieves higher color accuracy, lower perceptual distortion, and improved temporal consistency compared to baselines such as AniDoc, Tooncrafter, LVCD, and various naive concatenation or IPA-based diffusion models.
Sketch Alignment: The framework yields better sketch alignment (SA), maintaining structural fidelity to the input sketches throughout every frame.
Robustness: The dual-control mechanism enables robust colorization in single/multi-character scenes, fast motion, and scenes with newly introduced objects or backgrounds.
User Preference: A user paper confirms that outputs are consistently preferred for visual quality, color control, and temporal flow.

Qualitative comparisons illustrate reduced artifacts, better handling of color consistency across frames, and more faithful adherence to both structure and reference palette.

6. Applications and Broader Implications

AnimeColor’s architecture is immediately applicable in industrial animation colorization pipelines:

Automation: It enables the colorization of entire animation sequences consistent with a reference palette, substantially reducing manual labor.
Flexibility: The ability to swap sketch/reference pairs, or to use diverse reference images, allows for rapid prototyping and style adjustments without retraining.
Generalization: The framework supports not only animation, but also related tasks including line art colorization and digital illustration, extending the DiT-diffusion approach to broader creative domains.
Industrial Potential: The model’s robustness and performance make it a candidate for integration into content production systems for anime studios, supporting consistent, style-driven output.

7. Relevant Mathematical Formulation

Key formulas from the framework include:

Cross-Attention in HCE Integration:

$F_\text{out} = F_\text{in} + \text{Cross-Attention}(F_\text{in}, F_h)$

where $F_h$ are the high-level color tokens from the reference image.

Training Losses (summarized by stage):

$\mathcal{L}_k = \mathbb{E}_{z_0, t, c_\text{txt}, c_\text{skt}, I_\text{ref}, \epsilon} \left[\|\epsilon - \epsilon_\theta(z_t, c_\text{txt}, c_\text{skt}, \mathcal{E}_h(I_\text{ref}), \mathcal{E}_l(I_\text{ref}), t)\|_2^2\right]$

with $\mathcal{E}_h$ and $\mathcal{E}_l$ denoting the high- and low-level color extractors.

This multi-layered structuring of conditioning and pretraining ensures the integrity of both color and structure throughout denoising.

AnimeColor represents a significant methodological advance in reference-based animation colorization, unifying transformer-based diffusion, semantic/detail color control, and staged optimization to deliver high color fidelity, sketch alignment, and temporal consistency for practical animation production (Zhang et al., 27 Jul 2025).

PDF Markdown Chat (Pro)

References (1)

AnimeColor: Reference-based Animation Colorization with Diffusion Transformers (2025)

Follow Topic

Get notified by email when new papers are published related to AnimeColor.