Papers
Topics
Authors
Recent
Search
2000 character limit reached

OmniLottie: Vector Animation Framework

Updated 4 March 2026
  • OmniLottie is a versatile framework that generates vector animations from text, image, and video inputs using the Lottie format.
  • It employs a custom Lottie tokenizer to compress and encode essential commands, reducing token length by approximately 5.3×.
  • Built on a multimodal vision-language transformer and the extensive MMLottie-2M dataset, it delivers state-of-the-art synthesis performance.

OmniLottie is a versatile end-to-end framework for generating high-quality vector animations from multi-modal instructions, leveraging the Lottie format as its core animation representation. It introduces a custom Lottie tokenizer and is built atop a multimodal vision-language transformer, achieving state-of-the-art results on vector animation synthesis tasks from text, image, and video prompts (Yang et al., 2 Mar 2026).

1. The Lottie Format: Structure and Modeling Challenges

Lottie is a nested JSON-based interchange format encoding vector graphics (shapes, fills, strokes, solids, text) along with animation directives (keyframes, transforms, masks, effects, parenting, etc.). Each Lottie file encompasses:

  • Global metadata: version (vv), frame rate (frfr), dimensions (ww, hh), start/end frame indices (ipip, opop)
  • Ordered array of layers, each described by:
    • Base properties (name, type, timeline parameters, hierarchy, etc.)
    • Visual properties (position, scale, rotation, opacity, effects, blend modes)
    • Type-specific attributes (e.g., path data, text, external assets)

The Lottie schema imposes strict syntactic requirements; even trivial formatting violations yield invalid JSON. Approximately 80% of raw tokens represent structural metadata and formatting tokens, resulting in sequences of 2–3k tokens per file. This intrinsic verbosity and the heavy entanglement of geometry and motion specifications introduce significant obstacles for sequence modeling and autoregressive generation, including excessive sequence length, highly repetitive structure, and stringent validity constraints.

2. The Lottie Tokenizer

To address the inefficiencies of direct JSON modeling, OmniLottie introduces a Lottie-specific tokenizer that strips away invariant syntax and encodes only semantically relevant commands and parameters. This approach includes:

  • Extraction of essential metadata: retaining only core fields (vv, frfr, ipip, opop, ww, hh)
  • Transformation of each layer into a sequence beginning with a type marker (e.g., \langleLAYER–4\rangle), followed by quantized numeric parameters and text fields
  • Numeric quantization:

token(x,t)=xst+ot\text{token}(x, t) = \lfloor x \cdot s_t \rfloor + o_t

where sts_t is a scale, oto_t is a vocabulary offset, and tt indicates parameter type (spatiotemporal, opacity, etc.)

  • Text field handling: embedded with the vision-LLM (VLM) tokenizer, prefixed with their length
  • Lossless round-trip encoding: decode by subtracting offset, dividing by scaling, and detokenizing text

Algorithmically, encoding parses the JSON to metadata and layers, recursively converts layer parameters, and appends tokens to a sequential representation; decoding exactly inverts the process (see Algorithm 1 in (Yang et al., 2 Mar 2026)). The resulting token sequences provide 5.3×\sim5.3\times compression, reducing per-sample length from 2.6\sim2.6k to 0.5\sim0.5k tokens, and sharply increase model throughput and validity.

3. Architecture of the OmniLottie Framework

OmniLottie utilizes Qwen2.5-VL, a large multimodal vision-language transformer pretrained on image, video, and text modalities. It extends the base model’s vocabulary with randomly initialized embeddings for new command and parameter tokens introduced by the tokenizer.

Input prompts may consist of:

  • Text-only instructions
  • Text plus image tokens (for Text+Image→Lottie generation)
  • Video frame tokens (for Video→Lottie tasks)
  • Special separator markers demarcating modality boundaries

The model is trained autoregressively to predict the sequence of Lottie tokens xs[i]x_s^{[i]} conditioned on all prior outputs xs[<i]x_s^{[<i]} and prompt context xcx_c (text, image, and/or video). The learning objective is cross-entropy minimization over the discrete token space:

θ=argminθi=1LlogP(xs[i]xc;xs[<i];θ)\theta^* = \arg\min_\theta -\sum_{i=1}^L \log P(x_s^{[i]} \mid x_c; x_s^{[<i]}; \theta)

At inference, the transformer emits a Lottie token sequence, which is detokenized into a valid, renderable JSON file.

4. The MMLottie-2M Dataset and Benchmark

OmniLottie’s training corpus, MMLottie-2M, comprises 2 million Lottie animations paired with text and, where appropriate, image or video annotations. Sourcing and processing are as follows:

  • Web-crawled Lottie animations (∼1.2M) from five platforms (LottieFiles 42%, IconScout 24%, Flaticon 18%, Iconfont 10%, Icons8 6%), with filtering to exclude base64-encoded images, audio channels, and After Effects expressions
  • SVG-derived dynamic animations (∼0.8M) generated by animating vector shapes (translations, rotations, scaling, fading)
  • Motion-transfer augmentation: trajectory clustering over 1M ‘native’ Lotties to form motion templates (e.g., “fade-in + upward + scale-down”) applied to static SVGs

Processing pipeline includes spatial normalization (all to 512×512512\times512 with preserved aspect ratio), temporal normalization (0–60 frames, 30 fps, pastel backgrounds), and extraction of representative frames for multimodal scenarios.

Annotations employ a two-stage VLM-based captioning: initial coarse overall descriptions (color, object, motion type, style), followed by temporally structured captions emphasizing geometry and motion.

MMLottie-Bench, tailored for systematic evaluation, encompasses:

  • Real subset: 450 held-out professional samples (150 per prompt type)
  • Synthetic subset: 450 prompts/videos/images generated via LMs (GPT-4o, Gemini3.1-Pro, Seedance, etc.)
  • Evaluation metrics: rendering success rate, FVD, PSNR, SSIM, DINO, CLIP similarity, and 0–10 subjective object/motion alignment via Claude-3.5, alongside token and runtime efficiency

Mean animation durations: 3.2s (web), 2.8s (SVG), with post-normalization capped at 60 frames. Average layer complexity is 8.6 for web Lotties and 3.2 for SVG, with 87% shape layers, 8% precomps, 3% nulls.

5. Model Training and Implementation

The core OmniLottie model leverages the Qwen2.5-VL backbone (∼7B parameters) extended with custom Lottie token embeddings. Key training parameters include:

  • Mixed batch composition (70% web Lottie, 30% SVG-derived Lottie as empirically optimal)
  • Cross-entropy loss over the tokenizer output
  • AdamW optimization with warmup and cosine decay schedules, weight decay ∼0.02
  • Regularization via mixed-precision (FP16), dropout on transformer heads (p=0.1p=0.1), and heavy data augmentation
  • Multi-GPU distributed training on NVIDIA A100 hardware
  • Inference uses greedy and nucleus sampling (p=0.9p=0.9) to balance diversity and JSON validity

6. Experimental Evaluation and Comparative Results

OmniLottie is evaluated on text-to-Lottie, text+image-to-Lottie, and video-to-Lottie tasks, with baselines including DeepSeekV3, Qwen2.5-VL (zero-shot), GPT-5, commercial Recraft, AniClipart, Livesketch, and Gemini3.1-Pro.

Core metrics span:

  • Success Rate (valid renderable JSON files)
  • FVD (lower is better) over rendered video
  • CLIP similarity (alignment of prompt and animation)
  • Object Align and Motion Align (0–10, human evaluation via Claude-3.5)
  • PSNR, SSIM, and DINO for video-based tasks
  • Token count and runtime

Results on MMLottie-Bench (real subset):

Task Success (%) FVD (↓) CLIP (↑) Obj Align Motion Align
Text→Lottie 88.3 202.1 0.2748 4.44 5.94
Text+Image→Lottie 93.3 180.3 0.2666 5.10 4.44
Video→Lottie 88.1 227.1

Video→Lottie also achieves PSNR=16.08 dB, SSIM=0.82, and DINO=0.92.

Ablation studies show:

  • Mixing 70% real Lottie and 30% SVG yields best overall results.
  • Tokenizer provides a critical advantage: raw-JSON fine-tuning yields ∼13% Text→Lottie success, whereas the tokenizer enables 97% (FVD: 269.5 vs. 459.4).
  • Qualitatively, OmniLottie outputs capture semantic alignment and nuanced motion details; baselines often fail to produce valid JSON or exhibit geometric/motion misalignment.

7. Significance and Impact

OmniLottie demonstrates that with a domain-aware tokenizer, pretrained multimodal transformers, and a large-scale annotated dataset, it is possible to achieve robust, semantically precise, and visually plausible vector animation generation conditioned on diverse multi-modal prompts. The tokenizer addresses crucial modeling obstacles inherent in Lottie’s verbose schema and enables highly compact, valid, and compositional sequence modeling.

By curating MMLottie-2M—the largest professionally annotated vector animation dataset—and designating comprehensive evaluation metrics, OmniLottie establishes both a new methodological foundation and empirical benchmark for vector animation synthesis research (Yang et al., 2 Mar 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OmniLottie.