TimeColor: Flexible Video Colorization Model

Updated 8 January 2026

TimeColor is a multi-reference video colorization model that enables artists to assign explicit color cues to designated regions in sketches and frames.
It employs temporal latent concatenation and spatiotemporal correspondence-masked transformer attention to achieve high color fidelity and identity consistency.
The model integrates modality-disjoint RoPE indexing with a diffusion pipeline to prevent cross-reference color leakage and maintain robust temporal stability.

TimeColor is a sketch-based video colorization model that enables flexible, multi-reference color guidance in 2D animation and related video workflows. Unlike conventional colorization methods that condition only on a single initial reference frame, TimeColor supports arbitrary numbers and heterogeneous types of references—including character sheets, background plates, and arbitrary colorized frames—with explicit region assignments per reference. This is accomplished via a combination of temporal latent frame concatenation, spatiotemporal correspondence-masked transformer attention, and modality-disjoint Rotary Positional Embedding (RoPE) indexing. The model achieves higher color fidelity, identity consistency, and temporal stability relative to state-of-the-art baselines on large-scale animation datasets (Sadihin et al., 1 Jan 2026).

1. Contrast with Single-Reference Colorization

In professional 2D animation, maintaining palette continuity and subject-accurate color association across scenes, poses, and backgrounds often requires referencing multiple sources. Historically, diffusion-based colorization systems have accepted only a single reference (typically the first keyframe), either neglecting additional cues or requiring sequential multi-pass propagation pipelines. These limitations increase manual effort, introduce risks of color bleeding, and often degrade temporal consistency.

TimeColor addresses these constraints by allowing an unbounded number of color references to be injected per inference pass, each with explicit “subject→reference” bindings. Regions in the sketch inherit color only from their designated references, so artists can flexibly specify and update palette cues across large pose/viewpoint changes and complex scene transitions. This approach reduces propagation effort, prevents inter-subject color leakage, and yields more faithful, temporally stable colorized outputs.

2. Temporal Latent Concatenation and Visual Input Encoding

All visual inputs—noisy video frames, sketch frames, and $R$ color reference images—are passed through a shared VAE encoder, producing token grids at latent spatial resolution $(H', W')$ and channel dimension $C$ . The inputs are concatenated along the temporal axis to form a single spatiotemporal sequence for transformer processing: $Z_{\mathrm{concat}} \in \mathbb{R}^{B \times (T + T + R) \times H' \times W' \times C}$ where the first $T$ frames are noisy latents, the next $T$ are corresponding sketches, and the last $R$ are reference-image latents. This temporal concatenation supports concurrent diffusion processing of all inputs, with the model's parameter count remaining constant as $R$ increases. Compute increases linearly with reference count, but additional references do not change model architecture.

3. Spatiotemporal Correspondence-Masked Transformer Attention

To bind each region of the target video exclusively to its intended references and prevent shortcutting or cross-identity palette leakage, TimeColor introduces a hard, per-token attention mask $M \in \{0,1\}^{N \times N}$ . Per-pixel ground-truth masks $\mathcal{M}_{t,r}(x,y)$ associate each spatial location with its designated reference. These masks are downsampled to the patch grid, assigning each token a reference identity $\rho(i)$ .

Attention is then restricted by: $M_{ij} = \begin{cases} 1, & \text{if token %%%%10%%%% is text or token %%%%11%%%% is not a reference, or } \rho(i)=\rho(j),\ 0, & \text{otherwise.} \end{cases}$

$\alpha_{ij} = \mathrm{softmax}_j \left(\frac{q_i^\top k_j}{\sqrt{d}} + (1 - M_{ij}) \cdot (-\infty)\right), \quad o_i = \sum_j \alpha_{ij} v_j$

This enforces that a sketch or noisy region $i$ may attend only to the key/value tokens $j$ belonging to its assigned reference, ensuring exclusive subject-reference binding and mitigating unintended color transfer across subjects.

4. Modality-Disjoint Rotary Positional Embedding (RoPE) Indexing

Simple temporal concatenation of video, sketch, and reference frames risks positional embedding collisions. TimeColor resolves this via modality-disjoint RoPE indexing, partitioning the index space by modality $m\in\{0,1,2\}$ (video, sketch, reference). For token temporal index $l$ and spatial coordinates $(i, j)$ : $\mathrm{RoPE}_m(l, i, j) = \mathrm{RoPE}\big(l,\, i + m \cdot H',\, j + m \cdot W'\big)$ This ensures non-overlapping position encoding for video, sketch, and reference tokens. References are assigned negative frame indices $l = -r$ to achieve temporal separation in the rotary embedding space, further preventing modality entanglement and palette confusion.

5. Diffusion Pipeline and Loss

TimeColor follows a variance-preserving DDPM noise schedule with $N$ timesteps. The latent at timestep $n$ is generated by: $z_n = \sqrt{\bar\alpha_n} z_0 + \sqrt{1-\bar\alpha_n}\, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I), \quad n \sim \mathrm{Uniform}(\{1, \dots, N\})$ The transformer predicts the noise $\epsilon_\theta(z_n, n, \mathcal{C})$ , where context $\mathcal{C}$ aggregates noisy video latents, sketches, references, masks, and optional text. The training objective is the single-term noise prediction loss: $\mathcal{L}_{\mathrm{noise}} = \mathbb{E}_{z_0, n, \epsilon} \left[\| \epsilon - \epsilon_\theta(z_n, n, \mathcal{C}) \|_2^2 \right]$ Inference consists of standard DDPM denoising and VAE decoding to reconstruct RGB video frames.

6. Architectural Details and Model Budget

TimeColor utilizes the CogVideoX-5B (DiT) transformer backbone with 5B parameters, operating at latent resolution $480 \times 720$ (patch size $8\times8$ ), yielding $H' = 60$ , $W' = 90$ . The transformer comprises 48 layers with 16-head spatiotemporal self-attention (embedding dimension $d = 4096$ ) and MLP blocks with hidden size $16384$. All inputs are processed with a shared VAE encoder and decoder; the absence of additional adapters or special branches ensures parameter count remains fixed as reference number increases.

7. Experimental Validation

On the SAKUGA-42M dataset, candidate characters are enumerated using InternVL3, per-frame masks generated with GroundingDINO and SAM2, and a minimum frame gap $g=17$ enforced for diverse training sample generation, yielding approximately $120$K single-reference and $96$K multi-reference training samples. Training follows a three-stage curriculum on 6×A40 GPUs (AdamW, learning rate $1 \times 10^{-5}$ , supervision window $f=17$ ).

Evaluation on approximately $1200$ clips includes starting-frame, arbitrary-frame, and multi-reference regimes. Baselines comprise LVCD, AniDoc, ToonCrafter, ToonComposer, LongAnimation, and VACE. For multi-reference scenarios, single-reference baselines are extended by multi-step or tiled-collage workarounds.

Metrics:

Frame-level PSNR, SSIM, LPIPS; distributional FID; video-level FVD.

Results (sample metrics):

Starting-frame: SSIM $0.85$ (vs. $0.75$ next best), PSNR $24.9$ dB (vs. $21.8$ dB next best).
Arbitrary-frame: SSIM $0.81$, PSNR $21.9$ dB.
Multi-reference: SSIM $0.76$, PSNR $18.9$ dB.
FVD and FID reduced by $20–50 \%$ .

Qualitative observations show preserved subject palettes under extreme viewpoint or pose changes and robust cross-reference consistency. Hard attention masking eliminates spurious color mixing prevalent in “soft” mask-as-condition baselines.

Ablations:

Removing modality-disjoint RoPE induces washed-out colors in late frames.
Removing correspondence-masked attention yields severe cross-reference leakage.
Masking only reference-to-reference tokens reduces, but does not eliminate, leakage.

This suggests that TimeColor’s combination of temporal concatenation, modality-disjoint RoPE, and hard attention gating within a DiT backbone enables highly flexible, faithful multi-reference colorization under a fixed model budget, substantially outperforming prior single-reference methods in both accuracy and stability (Sadihin et al., 1 Jan 2026).

Markdown Upgrade to Chat

References (1)

TimeColor: Flexible Reference Colorization via Temporal Concatenation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TimeColor.