MMLottie-2M Dataset for Multi-modal Animation

Updated 4 March 2026

MMLottie-2M is a large-scale vector animation dataset comprising 2M samples with detailed multi-modal annotations across text, image, and video modalities.
It employs a principled tokenizer and normalization pipeline to convert raw Lottie JSON files into structured, model-friendly representations.
The dataset integrates diverse sources—including web-crawled and synthetic SVG-derived animations—with standardized splits for robust training and evaluation.

MMLottie-2M is a large-scale, unified dataset of vector animations, created to advance research in multi-modal generative modeling for Lottie animations. Comprising two million samples paired with rich textual and visual annotations, MMLottie-2M emphasizes both professional design diversity and procedural motion coverage, supporting modeling tasks conditioned on text, image, and video modalities. Its curation incorporates a principled tokenizer and normalization pipeline to transform raw Lottie JSON files into structured, model-friendly representations, thereby facilitating the training and evaluation of next-generation vector animation models (Yang et al., 2 Mar 2026).

1. Dataset Composition and Source Breakdown

The MMLottie-2M dataset, denoted $\mathcal{D}$ , totals $|\mathcal{D}| = 2\,000\,000$ vector animation samples. Source distribution comprises two primary classes:

Web-crawled Lottie files: $1.2$ million instances sampled from five distinct platforms: LottieFiles (42.3%), IconScout (23.7%), Flaticon (18.1%), Iconfont (9.8%), Icons8 (6.1%).
Synthetic SVG-derived Lottie files: $0.8$ million instances produced by animating static SVGs (from the OmniSVG collection) using $1$–$3$ randomly assigned procedural motion templates.

Data splits follow a standard protocol: 96% train ( $|\mathcal{D}_{\text{train}}| = 1\,920\,000$ ), 2% validation ( $|\mathcal{D}_{\text{val}}| = 40\,000$ ), 2% test ( $|\mathcal{D}_{\text{test}}| = 40\,000$ ). For benchmarking, the dataset defines MMLottie-Bench, a hold-out evaluation subset comprising $900$ cases (split equally by real/synthetic content and by modeling task), all disjoint from the training set.

2. Modalities, Annotation Schema, and Data Pairing

Each animation is annotated with multiple modalities:

Textual descriptions
- Coarse caption: A global summary (mean $86 \pm 21$ words) describing high-level identity, colors, and visual style.
- Fine-grained temporal caption: Sequential details (mean $114 \pm 25$ words) structured temporally, emphasizing object identities, spatial layout, and explicit motion verbs (e.g., "fading in," "rotating clockwise").
Visual frames
- Rendered video: $512\times512$ px resolution, $30$ fps, capturing the full animation sequence.
- Keyframe image: Single frame from $t \in [0.2, 0.8] \cdot T$ to support the Text-Image-to-Lottie (TI₂Lottie) modality.
Multi-modal labels
- (image, text) pairs for TI₂Lottie.
- Raw video instructions for Video-to-Lottie tasks.
Additional metadata
- Semantic tags: Fifteen high-level categories (e.g., “UI elements”: 50%, “abstract patterns”: 20%, others).
- Color vocabulary: Dominant color tokens (e.g., "blue," "red," "white" among top-5).
- Motion types: Distribution primarily favoring translation, then rotation, scaling, opacity changes, and path morphing.

The annotation pipeline leverages Qwen2.5-VL for dual-stage captioning, sequentially producing a global overview and temporally organized per-frame details with explicit use of geometry and motion keywords.

3. Data Pipeline: Collection, Curation, and Normalization

The dataset construction follows a structured pipeline:

A. Source aggregation: Web crawling yields $\sim1.2$ million raw Lottie JSON files. Additionally, $\sim2$ million static SVGs (OmniSVG) are converted to Lottie format, with procedural animation yielding $0.8$ million synthetic clips.
B. Cleaning and filtering: Lottie files containing base64 image layers, audio/camera, After Effects expressions, or non-parameterizable (e.g., 3D/data) layers are discarded.
C. Spatio-temporal normalization:
- Spatial: All animations are centered and scaled to a $512 \times 512$ canvas using scaling factor $r = \min(512/w_{\text{orig}}, 512/h_{\text{orig}})$ .
- Temporal: All animation key times, including $ip$ , $op$ , and intermediate keyframes, are normalized to $[0, 16]$ (approx. $0$–$60$ frames at 30 fps), so $t_{\text{norm}} = 16\cdot (t_{\text{orig}} - ip)/(op - ip)$ .
D. Rendering: Each Lottie is rendered as MP4 (512 × 512 px, 30 fps), overlayed on a random pastel background (20-color palette), with a keyframe extracted for image annotation.
E. Multi-modal annotation: Automated, two-stage captioning augmented with geometry/motion terms for alignment with downstream generative tasks.

4. Statistical and Structural Properties

The MMLottie-2M dataset displays extensive variation in both temporal and structural dimensions:

Temporal characteristics
- Web-crawled: Mean duration $3.2$ s ( $\sigma=2.1$ s); 67% in $[1,5]$ s, 21% in $[5,8]$ s, 12% exceeding $8$ s.
- Synthetic: Mean duration $2.8$ s ( $\sigma=0.9$ s).
- After normalization, all animations occupy a uniform 16-unit time window (approx. $90$ frames at $30$ fps).
Structural features
- Layer count: Web-crawled mean $8.6$ (max $324$), synthetic mean $3.2$ (max $45$).
- Layer-type distribution: Shape (86.8%), Precomp (8.2%), Null (2.9%), Solid (1.5%), Text (0.6%).
- Nesting depth: 78.3% of samples have depth $=1$ , 16.7% depth $=2$ , 5.0% depth $\geq3$ .
Control parameter statistics
- Position: $(x, y)$ clustered in $[-256, +256]$ .
- Scale: Usually $[0.5, 2.0]$ , with outliers up to $4\times$ .
- Rotation: Range $[0^\circ, 360^\circ]$ (looped) or oscillatory $\pm180^\circ$ .
- Opacity: $[0, 100]$ %.
- Keyframe times: $[0, 16]$ ; color channels: integer $[0, 255]$ .

5. File Format Structure and Tokenization

To facilitate learning and ensure lossless representation, MMLottie-2M employs a structured Lottie tokenizer:

JSON schema
- Core fields: $v$ , $fr$ , $ip$ , $op$ , $w$ , $h$ , $nm$ , $ddd$ , layers;
- Conditional: $assets$ , $markers$ , $fonts$ , $chars$ (present only as used).
Tokenizer mapping
- Each Lottie file is decomposed to metadata $\mathcal{M}$ and a set of layers $\{\mathcal{L}_i\}$ , where each layer $\mathcal{L}_i = (\tau_i, \mathcal{A}_{\tau_i}, \mathcal{T}_i, \mathcal{E}_i)$ for type $\tau_i \in \{0,1,3,4,5\}$ .
- Continuous parameters quantized: $\mathrm{token}(x, t) = \lfloor x \cdot s_t \rfloor + o_t$ , where $s_t$ is parameter-type-specific scale and $o_t$ is the vocabulary offset.
- The token sequence: $\mathcal{T} = [\text{CMD}_1, p_{1,1}, \ldots, p_{1,k_1}, \ldots, \text{CMD}_M, p_{M,k_M}]$ .
- Text fields (e.g., font names, captions) are subword-tokenized using a pretrained VLM tokenizer, prefixed with a length token.
- Special marker and <END> tokens ensure full, structure-preserving encoding.
- Reconstruction is lossless via inversion: $p = (\mathrm{token} - o_t)/s_t$ and $text = V_{\text{text}}^{-1}(\ldots)$ .

6. Comparison with Prior Datasets and Benchmarks

MMLottie-2M represents a substantial advance relative to previous publicly available Lottie datasets:

Scale: $2\,000\,000$ samples, substantially larger than prior real-Lottie corpora (typically $<100\,000$ ) and uniquely offering comprehensive multi-modal annotation.
Modal coverage: First dataset to provide unified text, image, and video conditioning for vector animation tasks.
Annotation depth: Incorporates dual-stage, temporally fine-grained captions. Prior datasets offer only single-sentence or coarse-grained descriptions.
Content diversity: Integrates professionally designed assets from five distinct sources and procedurally generated SVG-based animations, spanning broad stylistic and motion categories.
Benchmarking: Establishes MMLottie-Bench, a standardized evaluation suite comprising both real and synthetic subsets, with $900$ holdout samples and LLM-judged alignment metrics (e.g., object/motion alignment).

A plausible implication is that the size, diversity, and annotation richness of MMLottie-2M enable robust model training and facilitate rigorous benchmark comparisons for generative systems targeting vector animation, such as OmniLottie (Yang et al., 2 Mar 2026).

Table: Source Composition of MMLottie-2M

Source Type	Fraction (%)	Number of Files
LottieFiles	42.3	$\approx$ 507,600
IconScout	23.7	$\approx$ 284,400
Flaticon	18.1	$\approx$ 217,200
Iconfont	9.8	$\approx$ 117,600
Icons8	6.1	$\approx$ 73,200
Synthetic SVGs	—	800,000

MMLottie-2M establishes a new standard for vector animation datasets, with design decisions tailored for supporting multi-modal generative modeling, evaluation, and analysis within the domain of Lottie JSON-encoded content (Yang et al., 2 Mar 2026).

Markdown Report Issue Upgrade to Chat

References (1)

OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MMLottie-2M Dataset.

MMLottie-2M Dataset for Multi-modal Animation

1. Dataset Composition and Source Breakdown

2. Modalities, Annotation Schema, and Data Pairing

3. Data Pipeline: Collection, Curation, and Normalization

4. Statistical and Structural Properties

5. File Format Structure and Tokenization

6. Comparison with Prior Datasets and Benchmarks

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MMLottie-2M Dataset for Multi-modal Animation

1. Dataset Composition and Source Breakdown

2. Modalities, Annotation Schema, and Data Pairing

3. Data Pipeline: Collection, Curation, and Normalization

4. Statistical and Structural Properties

5. File Format Structure and Tokenization

6. Comparison with Prior Datasets and Benchmarks

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research