Papers
Topics
Authors
Recent
Search
2000 character limit reached

MMLottie-2M Dataset for Multi-modal Animation

Updated 4 March 2026
  • MMLottie-2M is a large-scale vector animation dataset comprising 2M samples with detailed multi-modal annotations across text, image, and video modalities.
  • It employs a principled tokenizer and normalization pipeline to convert raw Lottie JSON files into structured, model-friendly representations.
  • The dataset integrates diverse sources—including web-crawled and synthetic SVG-derived animations—with standardized splits for robust training and evaluation.

MMLottie-2M is a large-scale, unified dataset of vector animations, created to advance research in multi-modal generative modeling for Lottie animations. Comprising two million samples paired with rich textual and visual annotations, MMLottie-2M emphasizes both professional design diversity and procedural motion coverage, supporting modeling tasks conditioned on text, image, and video modalities. Its curation incorporates a principled tokenizer and normalization pipeline to transform raw Lottie JSON files into structured, model-friendly representations, thereby facilitating the training and evaluation of next-generation vector animation models (Yang et al., 2 Mar 2026).

1. Dataset Composition and Source Breakdown

The MMLottie-2M dataset, denoted D\mathcal{D}, totals D=2000000|\mathcal{D}| = 2\,000\,000 vector animation samples. Source distribution comprises two primary classes:

  • Web-crawled Lottie files: $1.2$ million instances sampled from five distinct platforms: LottieFiles (42.3%), IconScout (23.7%), Flaticon (18.1%), Iconfont (9.8%), Icons8 (6.1%).
  • Synthetic SVG-derived Lottie files: $0.8$ million instances produced by animating static SVGs (from the OmniSVG collection) using $1$–$3$ randomly assigned procedural motion templates.

Data splits follow a standard protocol: 96% train (Dtrain=1920000|\mathcal{D}_{\text{train}}| = 1\,920\,000), 2% validation (Dval=40000|\mathcal{D}_{\text{val}}| = 40\,000), 2% test (Dtest=40000|\mathcal{D}_{\text{test}}| = 40\,000). For benchmarking, the dataset defines MMLottie-Bench, a hold-out evaluation subset comprising $900$ cases (split equally by real/synthetic content and by modeling task), all disjoint from the training set.

2. Modalities, Annotation Schema, and Data Pairing

Each animation is annotated with multiple modalities:

  • Textual descriptions
    • Coarse caption: A global summary (mean 86±2186 \pm 21 words) describing high-level identity, colors, and visual style.
    • Fine-grained temporal caption: Sequential details (mean 114±25114 \pm 25 words) structured temporally, emphasizing object identities, spatial layout, and explicit motion verbs (e.g., "fading in," "rotating clockwise").
  • Visual frames
    • Rendered video: 512×512512\times512 px resolution, $30$ fps, capturing the full animation sequence.
    • Keyframe image: Single frame from t[0.2,0.8]Tt \in [0.2, 0.8] \cdot T to support the Text-Image-to-Lottie (TI₂Lottie) modality.
  • Multi-modal labels
    • (image, text) pairs for TI₂Lottie.
    • Raw video instructions for Video-to-Lottie tasks.
  • Additional metadata
    • Semantic tags: Fifteen high-level categories (e.g., “UI elements”: 50%, “abstract patterns”: 20%, others).
    • Color vocabulary: Dominant color tokens (e.g., "blue," "red," "white" among top-5).
    • Motion types: Distribution primarily favoring translation, then rotation, scaling, opacity changes, and path morphing.

The annotation pipeline leverages Qwen2.5-VL for dual-stage captioning, sequentially producing a global overview and temporally organized per-frame details with explicit use of geometry and motion keywords.

3. Data Pipeline: Collection, Curation, and Normalization

The dataset construction follows a structured pipeline:

  • A. Source aggregation: Web crawling yields 1.2\sim1.2 million raw Lottie JSON files. Additionally, 2\sim2 million static SVGs (OmniSVG) are converted to Lottie format, with procedural animation yielding $0.8$ million synthetic clips.
  • B. Cleaning and filtering: Lottie files containing base64 image layers, audio/camera, After Effects expressions, or non-parameterizable (e.g., 3D/data) layers are discarded.
  • C. Spatio-temporal normalization:
    • Spatial: All animations are centered and scaled to a 512×512512 \times 512 canvas using scaling factor r=min(512/worig,512/horig)r = \min(512/w_{\text{orig}}, 512/h_{\text{orig}}).
    • Temporal: All animation key times, including ipip, opop, and intermediate keyframes, are normalized to [0,16][0, 16] (approx. $0$–$60$ frames at 30 fps), so tnorm=16(torigip)/(opip)t_{\text{norm}} = 16\cdot (t_{\text{orig}} - ip)/(op - ip).
  • D. Rendering: Each Lottie is rendered as MP4 (512 × 512 px, 30 fps), overlayed on a random pastel background (20-color palette), with a keyframe extracted for image annotation.
  • E. Multi-modal annotation: Automated, two-stage captioning augmented with geometry/motion terms for alignment with downstream generative tasks.

4. Statistical and Structural Properties

The MMLottie-2M dataset displays extensive variation in both temporal and structural dimensions:

  • Temporal characteristics
    • Web-crawled: Mean duration $3.2$ s (σ=2.1\sigma=2.1 s); 67% in [1,5][1,5] s, 21% in [5,8][5,8] s, 12% exceeding $8$ s.
    • Synthetic: Mean duration $2.8$ s (σ=0.9\sigma=0.9 s).
    • After normalization, all animations occupy a uniform 16-unit time window (approx. $90$ frames at $30$ fps).
  • Structural features
    • Layer count: Web-crawled mean $8.6$ (max $324$), synthetic mean $3.2$ (max $45$).
    • Layer-type distribution: Shape (86.8%), Precomp (8.2%), Null (2.9%), Solid (1.5%), Text (0.6%).
    • Nesting depth: 78.3% of samples have depth =1=1, 16.7% depth =2=2, 5.0% depth 3\geq3.
  • Control parameter statistics
    • Position: (x,y)(x, y) clustered in [256,+256][-256, +256].
    • Scale: Usually [0.5,2.0][0.5, 2.0], with outliers up to 4×4\times.
    • Rotation: Range [0,360][0^\circ, 360^\circ] (looped) or oscillatory ±180\pm180^\circ.
    • Opacity: [0,100][0, 100]%.
    • Keyframe times: [0,16][0, 16]; color channels: integer [0,255][0, 255].

5. File Format Structure and Tokenization

To facilitate learning and ensure lossless representation, MMLottie-2M employs a structured Lottie tokenizer:

  • JSON schema
    • Core fields: vv, frfr, ipip, opop, ww, hh, nmnm, dddddd, layers;
    • Conditional: assetsassets, markersmarkers, fontsfonts, charschars (present only as used).
  • Tokenizer mapping
    • Each Lottie file is decomposed to metadata M\mathcal{M} and a set of layers {Li}\{\mathcal{L}_i\}, where each layer Li=(τi,Aτi,Ti,Ei)\mathcal{L}_i = (\tau_i, \mathcal{A}_{\tau_i}, \mathcal{T}_i, \mathcal{E}_i) for type τi{0,1,3,4,5}\tau_i \in \{0,1,3,4,5\}.
    • Continuous parameters quantized: token(x,t)=xst+ot\mathrm{token}(x, t) = \lfloor x \cdot s_t \rfloor + o_t, where sts_t is parameter-type-specific scale and oto_t is the vocabulary offset.
    • The token sequence: T=[CMD1,p1,1,,p1,k1,,CMDM,pM,kM]\mathcal{T} = [\text{CMD}_1, p_{1,1}, \ldots, p_{1,k_1}, \ldots, \text{CMD}_M, p_{M,k_M}].
    • Text fields (e.g., font names, captions) are subword-tokenized using a pretrained VLM tokenizer, prefixed with a length token.
    • Special marker and <END> tokens ensure full, structure-preserving encoding.
    • Reconstruction is lossless via inversion: p=(tokenot)/stp = (\mathrm{token} - o_t)/s_t and text=Vtext1()text = V_{\text{text}}^{-1}(\ldots).

6. Comparison with Prior Datasets and Benchmarks

MMLottie-2M represents a substantial advance relative to previous publicly available Lottie datasets:

  • Scale: 20000002\,000\,000 samples, substantially larger than prior real-Lottie corpora (typically <100000<100\,000) and uniquely offering comprehensive multi-modal annotation.
  • Modal coverage: First dataset to provide unified text, image, and video conditioning for vector animation tasks.
  • Annotation depth: Incorporates dual-stage, temporally fine-grained captions. Prior datasets offer only single-sentence or coarse-grained descriptions.
  • Content diversity: Integrates professionally designed assets from five distinct sources and procedurally generated SVG-based animations, spanning broad stylistic and motion categories.
  • Benchmarking: Establishes MMLottie-Bench, a standardized evaluation suite comprising both real and synthetic subsets, with $900$ holdout samples and LLM-judged alignment metrics (e.g., object/motion alignment).

A plausible implication is that the size, diversity, and annotation richness of MMLottie-2M enable robust model training and facilitate rigorous benchmark comparisons for generative systems targeting vector animation, such as OmniLottie (Yang et al., 2 Mar 2026).


Table: Source Composition of MMLottie-2M

Source Type Fraction (%) Number of Files
LottieFiles 42.3 \approx507,600
IconScout 23.7 \approx284,400
Flaticon 18.1 \approx217,200
Iconfont 9.8 \approx117,600
Icons8 6.1 \approx73,200
Synthetic SVGs 800,000

MMLottie-2M establishes a new standard for vector animation datasets, with design decisions tailored for supporting multi-modal generative modeling, evaluation, and analysis within the domain of Lottie JSON-encoded content (Yang et al., 2 Mar 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MMLottie-2M Dataset.