Papers
Topics
Authors
Recent
2000 character limit reached

Sakuga-42M: Cartoon Animation Dataset

Updated 8 January 2026
  • Sakuga-42M is a multimodal dataset comprising 42M authentic, hand-drawn keyframes from diverse cartoon animations.
  • It features detailed semantic annotations, including temporally coherent captions and fine-grained anime tags, for robust multimedia analysis.
  • Fine-tuning contemporary video models with Sakuga-42M yields significant improvements in cartoon comprehension and generation tasks.

Sakuga-42M is a large-scale, multimodal dataset of authentic, hand-drawn 2D cartoon animation, constructed to address the domain bias that separates cartoons from natural video in contemporary video foundation models. Comprising 42 million non-repetitive keyframes, the dataset is sourced from more than 150,000 publicly available cartoon videos across diverse regions, years, and artistic styles. Sakuga-42M provides comprehensive semantic annotations, including temporally coherent captions, fine-grained anime tags, rich content taxonomies, and auxiliary metadata. Empirical results demonstrate that contemporary models such as Video CLIP, Video Mamba, and SVD, when fine-tuned with Sakuga-42M, achieve significant gains in cartoon comprehension and generation tasks, evidencing a robust scaling law without saturation.

1. Dataset Construction and Scope

The Sakuga-42M dataset was built via a multi-step data-collection and preprocessing pipeline:

  • Raw Video Crawling: Custom crawlers aggregated approximately 150,000 videos from YouTube, Twitter, and animation communities, strictly adhering to privacy policies. Source formats included mkv, mp4, and webm.
  • Shot/Clip Splitting: PySceneDetect segmented videos into ∼1.4 million clips, with each clip having a minimum duration of 18 frames and extending up to 200+ frames.
  • Keyframe Detection: Leveraging the tendency of cartoons to "animate on twos or threes," SSIM was computed between adjacent frames, discarding duplicates and yielding exactly 42,000,000 keyframes by removing 45% of redundant frames.
  • Semantic Captioning: wd14-swin-v2 generated anime tags every 12 frames, BLIP-v2 provided initial captions, and ChatGPT-175B integrated tags and BLIP outputs to formulate coherent, temporally aware descriptions. Raw BLIP-v2 captions were preserved for future comparison.

Quantitative summary:

Attribute Statistic/Proportion
Total keyframes Ntotal=42MN_{\text{total}} = 42\,\mathrm{M}
Mean clip length 35 frames
Clip duration buckets <<12 (23.02%), 12–24 (27.51%), 24–48 (27.00%), 48–96 (15.99%), >>96 (6.48%)
Regional distribution JP: 45%, US: 20%, CN: 20%, EU: 15%
Artistic style Raster-based Asian: 70%, Vector-based Western: 30%
Temporal coverage Uniform across 1950s–2020s, each decade \sim12–18%, Nk-decade42M8N_{k\text{-}decade}\approx\frac{42\,\mathrm{M}}{8}

This structure ensures broad diversity in region, time, and style, spanning raster-based Asian animation and vector-based Western styles, and provides coverage from the 1950s through the 2020s (Pan, 2024).

2. Semantic Annotations and Taxonomy

Sakuga-42M is annotated across multiple modalities and hierarchical taxonomies:

  • Video–Text Description Pairs: Generated via BLIP-v2 and ChatGPT-175B, with an average length of 40 words; over 85% are longer than 20 words. Descriptions are temporally coherent, leveraging anime-tag priors.
  • Fine-grained Anime Tags: Over 600 tags from wd14-swin-v2, capturing character traits, objects, and scene semantics.
  • Content Taxonomies: Six orthogonal dimensions:
    • Time (per decade)
    • Venue (region: JP, US, CN, EU)
    • Media (Cel-animation, Raster, Vector)
    • Objective (Action, Comedy, Educational, Documentary, etc.)
    • Composition (Static, Dynamic, Pan, Zoom, Camera Move)
    • Character Type (Human, Animal, Creature, Object)
  • Auxiliary Metadata: CAFe-aesthetic score, dynamicity ratio (keyframes/frames), safety ratings (general, sensitive, questionable, explicit), text-presence probability, resolution, aspect ratio, split boundaries, URLs, and hash IDs.

Taxonomy is provided in a nested structure, facilitating multi-attribute queries and aggregated analysis.

3. Foundation Model Adaptation and Fine-Tuning

Sakuga-42M was used to fine-tune multiple contemporary architectures for enhanced cartoon comprehension and generation:

  • Video–Language Understanding: ViCLIP-200M and VideoMamba-25M (dual encoder: vision/text) trained with a contrastive loss:

Lcontrast=i=1N[logexp(sim(fiV,fiT)/τ)jexp(sim(fiV,fjT)/τ)+logexp(sim(fiT,fiV)/τ)jexp(sim(fiT,fjV)/τ)].\mathcal{L}_{\mathrm{contrast}} = -\sum_{i=1}^N \Bigl[\, \log\frac{\exp\bigl(\mathrm{sim}(f_i^V,f_i^T)/\tau\bigr)} {\sum_j\exp\bigl(\mathrm{sim}(f_i^V,f_j^T)/\tau\bigr)} + \log\frac{\exp\bigl(\mathrm{sim}(f_i^T,f_i^V)/\tau\bigr)} {\sum_j\exp\bigl(\mathrm{sim}(f_i^T,f_j^V)/\tau\bigr)} \Bigr].

Timesheet predictor objective:

Ltimesheet=c=1Ctclog(softmax(y)c).\mathcal{L}_{\mathrm{timesheet}} = -\sum_{c=1}^C t_c\,\log\bigl(\mathrm{softmax}(y)_c\bigr).

The training regimen entailed 1 epoch on 90% train split, batch size 256, masked learning probability 0.9, executed on 2×A6000 GPUs with DeepSpeed and Accelerate.

  • Video Generation (SVD-based): SVD-MV Stage-III with latent diffusion objective:

LSVD=Ezt,ϵϵϵθ(zt,t,c)2,c={conditioning from keyframes}.\mathcal{L}_{\mathrm{SVD}} = \mathbb{E}_{z_t,\epsilon}\bigl\|\,\epsilon - \epsilon_\theta(z_t,t,c)\bigr\|^2,\quad c=\{\text{conditioning from keyframes}\}.

Conditioning was performed with timesheet class predictions from ViCLIP/VideoMamba plus predictor. Fine-tuning involved 5,000 iterations, learning rate 5×1065\times 10^{-6}, frame resolution 8@576×3208@576\times320, on 2×A6000 GPUs, otherwise default SVD settings.

These adaptation protocols facilitated the measurement of cartoon‐specific comprehension and generation performance across scaling regimes.

4. Experimental Evaluation and Scaling Effects

Extensive empirical evaluation demonstrated substantial performance increases on cartoon-specific tasks:

Zero-Shot Retrieval

Models fine-tuned with Sakuga-42M achieved significant increases in Recall@k on text-to-video (T2V) and video-to-text (V2T) retrieval tasks, tested on a 5% split (44,000 clips, 2.01M keyframes).

Model + Domain R@1 T2V R@5 T2V R@10 T2V R@1 V2T R@5 V2T R@10 V2T
ViCLIP@InternVid-200M 22.4 38.2 45.1 16.8 30.8 37.7
+ Sakuga-42M 35.5 54.2 61.6 29.7 48.6 56.5
VideoMamba@NatVid 18.3 31.2 38.2 11.7 23.7 30.0
+ Sakuga-42M 57.1 75.3 81.2 56.4 74.4 80.5

Scaling law analysis shows monotonic improvements with increased data scale (Sakuga-Small \rightarrow Sakuga-Aesthetic \rightarrow full 42M), with no observed performance plateau at 42M keyframes.

Cartoon Generation

Comparative generation metrics: IS, FID, FVD, and CLIPSIM. Fine-tuning SVD on Sakuga-42M demonstrated notable gains.

Model IS FID FVD CLIPSIM
SVD@LVD (natural) 1.09 23.10 307.6 0.267
SVD + Sakuga-42M 1.12 14.67 92.3 0.279

Ablations confirm that both comprehension and generation tasks benefit from fine-tuning on larger corpus subsets.

Domain Gap Visualization

FID/FVD analyses indicate pronounced distributional divergence between natural and cartoon video domains: FID \sim80–81, FVD \sim475–498. t-SNE projections reveal isolated Sakuga-42M clusters, substantiating the domain gap hypothesis.

5. Benefits and Applications

Sakuga-42M supports a broad spectrum of cartoon-specific research and practical applications:

  • Domain Bridging: Data scale dramatically narrows the domain gap for models trained on natural footage, enabling robust adaptation to hand-drawn animation.
  • Generalization: Fine-tuned models generalize well across diverse styles (cel, limited, vector) and decades.
  • Data-Efficiency: Large volumes of high-quality, paired captions eliminate dependence on costly human annotation.

Principal applications include:

  1. Text-to-cartoon and cartoon-to-text retrieval
  2. Automated inbetweening (latent diffusion, frame interpolation)
  3. Line-art colorization (sketch-to-color referencing)
  4. Avatar and style transfer (via Stable Diffusion, AnimateDiff, LoRA/T2I-Adapter)
  5. Cartoon generation at scale (open-source Pika-style models)
  6. Video editing and sprite decomposition (layered editing, toon tracking)

6. Limitations and Future Directions

Several constraints and areas for improvement are identified:

  • Caption Coherence: Some LLM-generated descriptions lack precise spatial relationship modeling. Proposed future work includes reinforcement learning from human feedback.
  • Resolution: 80% of clips are 480p, limiting achievable fidelity; higher-resolution data would improve generation quality.
  • Taxonomy/Tag Bias: Over-representation of Asian cel-animation and gender/human-centric tags. Recommendations include more balanced tagging strategies and increased coverage of underrepresented styles.
  • Safety: While >99.5% of content is general, \sim0.4% is questionable and \sim0.07% is explicit. Users are encouraged to apply filtering according to application requirements.

Sakuga-42M establishes a new standard for cartoon research, enabling scalable methods for robust comprehension and high-fidelity generation, and bridging the gap to enable style-aware animation tools for academic and production settings (Pan, 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to SAKUGA-42M.