Sakuga-42M: Hand-drawn Animation Dataset
- Sakuga-42M is a vast collection of over 42 million hand-drawn keyframes, featuring diverse artistic styles and extensive semantic annotations.
- The dataset's multi-modal annotations, including video-text descriptions and anime tags, enable precise scene understanding and support robust animation generation.
- Its innovative pipeline—from automated shot segmentation to SSIM-based keyframe selection—drives significant improvements in model training, evaluation, and benchmarking.
The Sakuga-42M Dataset is the first large-scale dataset dedicated to hand-drawn cartoon animation, designed to address the significant domain gap between natural video and artistic 2D animation in both comprehension and generation tasks. Comprising 42 million keyframes drawn from over 1.2 million video clips, Sakuga-42M provides comprehensive semantic annotations, detailed video-text descriptions, and a broad representation of artistic styles, regional origins, and historical eras in animation. Its construction, scale, and multimodal richness establish a new foundation for training and benchmarking advanced machine learning models specific to the animation domain.
1. Scale, Diversity, and Composition
Sakuga-42M encompasses approximately 42 million keyframes sourced from around 150,000 raw cartoon videos, resulting in 1.2 million clips with an average of 35 keyframes each. The dataset captures a spectrum of animation:
- Artistic Styles: Includes cel-animation, rough sketches, tie-downs, trace paints, and fully colored frames, representing both raster-based (Asian) and vector-based (Western) workflows.
- Geographical and Temporal Diversity: Covers Japanese, American, Chinese, and European productions from the 1950s through the 2020s.
- Animation Techniques: Annotates motion timing typologies, such as "on-ones" (24 fps), "on-twos" (12 fps), and "on-threes" (8 fps), with explicit detection and classification.
- Resolution: The bulk of content is at 480p, with significant representation of both 16:9 and 4:3 aspect ratios, including higher resolutions up to 2160p.
The collection is sampled from publicly available sources (YouTube, Twitter, animation communities) and is distributed under CC BY-NC 4.0 licensing with reference-only URLs.
2. Semantic Annotations and Taxonomies
Each video clip in Sakuga-42M is accompanied by an extensive annotation suite:
- Video-Text Pairs: Descriptions are constructed by anime tagging models (e.g., wd14-swin-v2) and BLIP-v2, then further contextualized and temporally linked by LLMs such as ChatGPT-175B, yielding coherent captions of ~40 words per video.
- Anime Tags: Derived from large-scale datasets (Danbooru2021, Waifu) using dedicated models, providing explicit semantic categories (scene, character, action, etc.).
- Hierarchical Content Taxonomies: Cover attributes like period, venue, media, composition, character archetypes, and more, facilitating structured dataset queries.
- Aesthetic and Dynamics Scores: Employs models (e.g., cafe-aesthetic) to quantify visual appeal; dynamic score measures motion richness, supporting timesheet classification.
- Safety and Text Detection: Assigns safety ratings (>99.5% safe), with explicit flags for NSFW content, subtitles, and instructional overlays.
- Meta Information: Contains shot boundaries, unique hash identifiers, and source URLs.
Annotations are stored in Apache Parquet files for scalability and efficiency.
3. Technical Construction Pipeline
The Sakuga-42M technical pipeline consists of several automated stages:
- Acquisition: Web crawlers harvest animation content globally, enforcing copyright and license constraints.
- Shot Segmentation: PySceneDetect splits long videos into atomic narrative-consistent units with a minimum of 18 frames.
- Keyframe Selection: Structural similarity index (SSIM) is used to identify and discard repetitive frames (~45% reduction), ensuring unique and information-rich keyframes.
- Captioning: Sampled frames undergo multi-stage NL description, with anime taggers providing class guidance to BLIP-v2 and ultimately temporal/scene context fusion via large LLMs.
- Annotation Aggregation: Resulting multi-level tags and descriptions are coalesced into a navigable taxonomy.
4. Use in Model Training, Evaluation, and Benchmarking
Sakuga-42M enables comprehensive training and evaluation protocols for both video-language understanding and video generation tasks:
- Video-LLMs: Foundation models such as Video CLIP (ViCLIP) and Video Mamba are finetuned on Sakuga-42M's keyframes and enhanced captions. This results in monotonic improvements in retrieval and understanding tasks, exemplified by T2V retrieval (R@1) increasing from 18.3% to 57.1% for Video Mamba as dataset scale grows.
- Video Generation: Stable Video Diffusion (SVD), when finetuned on Sakuga-42M, exhibits significant performance gains: IS improves from 1.09 to 1.12, FID drops 36.5% (23.10→14.67), and FVD decreases 70% (307.6→92.3). Generated videos display superior animation-like dynamics and character consistency compared to models trained solely on natural video (Pan, 13 May 2024).
- Automatic Colorization and Sketch-Based Tasks: The SAKUGA dataset—a filtered and preprocessed subset of Sakuga-42M—serves as the evaluation and training ground for models like SketchColour, which focus on sketch-to-colour animation. Protocols ensure parity with prior benchmarks (AniDoc, LVCD, ToonCrafter), using fixed-resolution, sequential sampling, and controlled sketch generation via Anime2Sketch with strict binarization.
Example Table: Data Partitioning in SAKUGA Subset
| Partition | Original Videos (post-filter) | Sample Used |
|---|---|---|
| Training | ~150,000 | 80,000 |
| Test | ~60,000 | 1,000 |
| Frames/clip | 17 | 17 |
All frames are processed to pixels.
5. Evaluation Metrics and Benchmark Protocols
Quantitative evaluation relies on metrics that assess both colorization correctness and overall video quality:
- Mean Squared Color Error (MSCE):
- Peak Signal-to-Noise Ratio (PSNR):
- Structural Similarity (SSIM): Luminance, contrast, and structure awareness.
- LPIPS: Deep-learned perceptual similarity.
- Fréchet Video Distance (FVD) [unterthiner2019fvd]: Temporal coherence and realism assessment.
Metrics are reported as mean standard deviation, with frame counts matched to legacy baselines for direct comparison. The task protocol requires generating a sequence (typically 17 frames) from a colored keyframe and corresponding sketches.
6. Implications, Challenges, and Future Directions
Sakuga-42M addresses several longstanding challenges in animation AI research:
- Domain Gap: The dataset provides unique distributional properties—visual structuring, temporal dynamics, and style—that are absent from natural video, mitigating the ineffectiveness of natural video-trained models.
- Data Scarcity in Animation: Prior animation datasets offered only tens of thousands of keyframes, lacked geographic and stylistic breadth, and often relied on non-hand-drawn content. Sakuga-42M offers >40x increase in scale with authentic, context-rich data.
- Semantic Context for Multimodal Learning: Rich multi-granular annotations make Sakuga-42M suitable for training and evaluating vision-LLMs, content editing tools, inbetweening, retrieval, and bias/safety studies.
- Empirical Scaling Laws: Performance in both comprehension (retrieval, captioning) and generation tasks increases monotonically with Sakuga-42M scale, with non-converged scaling curves—suggesting further scale-up could yield additional advances.
A plausible implication is that foundational cartoon models trained or adapted on Sakuga-42M will continually benefit from increased dataset size, in line with trajectories observed in natural video research.
7. Access, Maintenance, and Community Impact
Sakuga-42M is open-source and available for academic and research purposes, with all annotation files, data pipelines, model weights, and curation scripts provided via the official repository. Annual updates ensure removal of dead links and addition of newly collected material. Content licensing respects creators’ rights and all data is referenced via URLs for legal compliance.
The dataset supports a broad range of animation-specific research threads—including supervised colorization, sketch-to-video, retrieval, style transfer, safety, and evaluation—establishing itself as a central resource for 2D animation foundation model advancement and benchmarking (Pan, 13 May 2024, Sadihin et al., 2 Jul 2025).