MMLottie-2M Dataset for Multi-modal Animation
- MMLottie-2M is a large-scale vector animation dataset comprising 2M samples with detailed multi-modal annotations across text, image, and video modalities.
- It employs a principled tokenizer and normalization pipeline to convert raw Lottie JSON files into structured, model-friendly representations.
- The dataset integrates diverse sources—including web-crawled and synthetic SVG-derived animations—with standardized splits for robust training and evaluation.
MMLottie-2M is a large-scale, unified dataset of vector animations, created to advance research in multi-modal generative modeling for Lottie animations. Comprising two million samples paired with rich textual and visual annotations, MMLottie-2M emphasizes both professional design diversity and procedural motion coverage, supporting modeling tasks conditioned on text, image, and video modalities. Its curation incorporates a principled tokenizer and normalization pipeline to transform raw Lottie JSON files into structured, model-friendly representations, thereby facilitating the training and evaluation of next-generation vector animation models (Yang et al., 2 Mar 2026).
1. Dataset Composition and Source Breakdown
The MMLottie-2M dataset, denoted , totals vector animation samples. Source distribution comprises two primary classes:
- Web-crawled Lottie files: $1.2$ million instances sampled from five distinct platforms: LottieFiles (42.3%), IconScout (23.7%), Flaticon (18.1%), Iconfont (9.8%), Icons8 (6.1%).
- Synthetic SVG-derived Lottie files: $0.8$ million instances produced by animating static SVGs (from the OmniSVG collection) using $1$–$3$ randomly assigned procedural motion templates.
Data splits follow a standard protocol: 96% train (), 2% validation (), 2% test (). For benchmarking, the dataset defines MMLottie-Bench, a hold-out evaluation subset comprising $900$ cases (split equally by real/synthetic content and by modeling task), all disjoint from the training set.
2. Modalities, Annotation Schema, and Data Pairing
Each animation is annotated with multiple modalities:
- Textual descriptions
- Coarse caption: A global summary (mean words) describing high-level identity, colors, and visual style.
- Fine-grained temporal caption: Sequential details (mean words) structured temporally, emphasizing object identities, spatial layout, and explicit motion verbs (e.g., "fading in," "rotating clockwise").
- Visual frames
- Rendered video: px resolution, $30$ fps, capturing the full animation sequence.
- Keyframe image: Single frame from to support the Text-Image-to-Lottie (TI₂Lottie) modality.
- Multi-modal labels
- (image, text) pairs for TI₂Lottie.
- Raw video instructions for Video-to-Lottie tasks.
- Additional metadata
- Semantic tags: Fifteen high-level categories (e.g., “UI elements”: 50%, “abstract patterns”: 20%, others).
- Color vocabulary: Dominant color tokens (e.g., "blue," "red," "white" among top-5).
- Motion types: Distribution primarily favoring translation, then rotation, scaling, opacity changes, and path morphing.
The annotation pipeline leverages Qwen2.5-VL for dual-stage captioning, sequentially producing a global overview and temporally organized per-frame details with explicit use of geometry and motion keywords.
3. Data Pipeline: Collection, Curation, and Normalization
The dataset construction follows a structured pipeline:
- A. Source aggregation: Web crawling yields million raw Lottie JSON files. Additionally, million static SVGs (OmniSVG) are converted to Lottie format, with procedural animation yielding $0.8$ million synthetic clips.
- B. Cleaning and filtering: Lottie files containing base64 image layers, audio/camera, After Effects expressions, or non-parameterizable (e.g., 3D/data) layers are discarded.
- C. Spatio-temporal normalization:
- Spatial: All animations are centered and scaled to a canvas using scaling factor .
- Temporal: All animation key times, including , , and intermediate keyframes, are normalized to (approx. $0$–$60$ frames at 30 fps), so .
- D. Rendering: Each Lottie is rendered as MP4 (512 × 512 px, 30 fps), overlayed on a random pastel background (20-color palette), with a keyframe extracted for image annotation.
- E. Multi-modal annotation: Automated, two-stage captioning augmented with geometry/motion terms for alignment with downstream generative tasks.
4. Statistical and Structural Properties
The MMLottie-2M dataset displays extensive variation in both temporal and structural dimensions:
- Temporal characteristics
- Web-crawled: Mean duration $3.2$ s ( s); 67% in s, 21% in s, 12% exceeding $8$ s.
- Synthetic: Mean duration $2.8$ s ( s).
- After normalization, all animations occupy a uniform 16-unit time window (approx. $90$ frames at $30$ fps).
- Structural features
- Layer count: Web-crawled mean $8.6$ (max $324$), synthetic mean $3.2$ (max $45$).
- Layer-type distribution: Shape (86.8%), Precomp (8.2%), Null (2.9%), Solid (1.5%), Text (0.6%).
- Nesting depth: 78.3% of samples have depth , 16.7% depth , 5.0% depth .
- Control parameter statistics
- Position: clustered in .
- Scale: Usually , with outliers up to .
- Rotation: Range (looped) or oscillatory .
- Opacity: %.
- Keyframe times: ; color channels: integer .
5. File Format Structure and Tokenization
To facilitate learning and ensure lossless representation, MMLottie-2M employs a structured Lottie tokenizer:
- JSON schema
- Core fields: , , , , , , , , layers;
- Conditional: , , , (present only as used).
- Tokenizer mapping
- Each Lottie file is decomposed to metadata and a set of layers , where each layer for type .
- Continuous parameters quantized: , where is parameter-type-specific scale and is the vocabulary offset.
- The token sequence: .
- Text fields (e.g., font names, captions) are subword-tokenized using a pretrained VLM tokenizer, prefixed with a length token.
- Special marker and <END> tokens ensure full, structure-preserving encoding.
- Reconstruction is lossless via inversion: and .
6. Comparison with Prior Datasets and Benchmarks
MMLottie-2M represents a substantial advance relative to previous publicly available Lottie datasets:
- Scale: samples, substantially larger than prior real-Lottie corpora (typically ) and uniquely offering comprehensive multi-modal annotation.
- Modal coverage: First dataset to provide unified text, image, and video conditioning for vector animation tasks.
- Annotation depth: Incorporates dual-stage, temporally fine-grained captions. Prior datasets offer only single-sentence or coarse-grained descriptions.
- Content diversity: Integrates professionally designed assets from five distinct sources and procedurally generated SVG-based animations, spanning broad stylistic and motion categories.
- Benchmarking: Establishes MMLottie-Bench, a standardized evaluation suite comprising both real and synthetic subsets, with $900$ holdout samples and LLM-judged alignment metrics (e.g., object/motion alignment).
A plausible implication is that the size, diversity, and annotation richness of MMLottie-2M enable robust model training and facilitate rigorous benchmark comparisons for generative systems targeting vector animation, such as OmniLottie (Yang et al., 2 Mar 2026).
Table: Source Composition of MMLottie-2M
| Source Type | Fraction (%) | Number of Files |
|---|---|---|
| LottieFiles | 42.3 | 507,600 |
| IconScout | 23.7 | 284,400 |
| Flaticon | 18.1 | 217,200 |
| Iconfont | 9.8 | 117,600 |
| Icons8 | 6.1 | 73,200 |
| Synthetic SVGs | — | 800,000 |
MMLottie-2M establishes a new standard for vector animation datasets, with design decisions tailored for supporting multi-modal generative modeling, evaluation, and analysis within the domain of Lottie JSON-encoded content (Yang et al., 2 Mar 2026).