Lottie Tokenization for Neural Animation
- Lottie Tokenization is the process of converting raw JSON vector animations into discrete tokens that efficiently capture key visual, temporal, and stylistic elements.
- The method decomposes Lottie files into command, numeric, and text tokens, reducing sequence lengths by up to 81% and simplifying model training.
- Experimental results demonstrate that tokenized data boosts animation synthesis performance, evidenced by higher success rates and improved generative metrics.
Lottie tokenization refers to the process of transforming raw Lottie JSON vector animation files into structured, compact sequences of discrete tokens suitable for neural generative modeling. The principal objective is to encode all relevant semantic content—such as animation shapes, colors, transforms, keyframe timing, and motion curves—while excluding invariant, redundant, or inert metadata and low-level JSON syntax. This representation is foundational to frameworks like OmniLottie, which leverage tokenized Lottie data for training generative vision–LLMs capable of flexible, high-quality animation synthesis from multi-modal instructions (Yang et al., 2 Mar 2026).
1. Formal Problem Formulation
The Lottie tokenization problem can be framed as converting a raw Lottie JSON file into a linear discrete token sequence that is maximally informative for vector animation generation while minimizing irrelevant syntactical and structural overhead. The hierarchical structure of Lottie files is decomposed into:
- Base metadata : , including version, framerate, input/output times, dimensions, and project-level settings.
- Layer set : Each layer is a tuple with:
- : layer type (Precomp, Solid, Null, Shape, Text)
- : static attributes (e.g., shapes, text content)
- : per-layer transform properties (position, scale, rotation, opacity, skew)
- : per-layer effect stacks (masks, blend modes, effects)
Tokenization involves removing all inert metadata (e.g., base64 blobs, camera/audio layers, redundant flags) and JSON formatting tokens. Continuous numeric parameters (e.g., position, scale) are quantized and assigned a unique integer token via mapping , where is a scale and is a vocabulary offset. Textual fields are tokenized via the backbone's pretrained text tokenizer (Yang et al., 2 Mar 2026).
2. Taxonomy of Token Types
The Lottie tokenizer operationalizes three principal token families:
- Command Tokens
<META>: Initiates encoding of base metadata .<LAYER-\tau>: Starts a new layer block of type .<END>: Terminates the current layer block.
- Numeric-Parameter Tokens
- Each numeric property is mapped to a distinct integer token sub-range via offset and quantization specific to its semantic type:
- Temporal (in/out points, keyframes)
- Spatial (position, anchor)
- Scale (x/y scaling)
- Rotation (degrees)
- Opacity (0–100)
- Style (stroke width, corner radius)
- Color (R/G/B, 0–255)
- Easing-curve control points
- Example: = 1000, = 10; = 20000, = 1.
- Text-Field Tokens
- String-valued fields (layer name, content, font family) are tokenized by the backbone and serialized as a length-prefixed sequence: .
3. Algorithmic Workflow and Example
The conversion from Lottie JSON to tokens proceeds via a deterministic sequence:
- Remove non-parameterizable layers (images, audio, camera), base64 blobs, and After Effects (AE) expressions.
- Extract base metadata .
- Initialize the token sequence as .
- Quantize each using and append to .
- For each layer :
- Append .
- For each numeric parameter in : append .
- For each string field in : tokenize with , serialize .
- Append .
Example (simplified):
Given a basic rectangle animation,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
{
"v": "5.9.0",
"fr": 30,
"ip": 0, "op": 60,
"w": 512, "h": 512,
"layers": [
{
"ty": 4, "nm": "MyRect",
"ks": {
"p": {"k": [256, 256]},
"s": {"k": [100, 100]},
"r": {"k": 0}
},
"shapes": [
{
"ty": "rc",
"p": {"k": [256, 256]},
"s": {"k": [200, 100]},
"r": {"k": 10}
}
]
}
]
} |
1 2 3 4 5 6 7 8 9 10 11 12 13 |
[<META>, token("5.9.0", ver), token(30, fr), token(0, ip), token(60, op),
token(512, w), token(512, h), token(0, ddd),
<LAYER-4>,
token(256, spatial), token(256, spatial),
token(100, scale), token(100, scale),
token(0, rotation),
[2, idx("My"), idx("Rect")],
<END>,
<LAYER-4-shape>,
token(256, spatial), token(256, spatial),
token(200, scale), token(100, scale),
token(10, cornerRadius),
<END>] |
4. Design Principles and Heuristics
Several key heuristics underpin the Lottie tokenizer:
- Core vs. Conditional Fields: Only serialize auxiliary blocks (assets, markers, fonts, chars) when referenced by a layer, pruning otherwise.
- Offset-Based Quantization: Offset and scale values for each parameter type are calibrated empirically to maximize token-range utilization and prevent overlap.
- Text-Token Integration: Pretrained is reused for string fields, leveraging language understanding inherent in the backbone.
- Structural Marker Reduction: All JSON punctuation is mapped to three control tokens, eliminating superfluous tokens from braces, commas, or nested arrays.
- Pad Values for Optionals: Reserved pad tokens ensure fixed slot allocation without vocabulary bloat.
- Invariant Field Pruning: Defaults (e.g., , unused blend modes) are omitted when present.
- Synthetic Data Mixture: Mixing 70% native Lottie data with 30% SVG-derived synthetic samples improves model generality and balances motion/statistics diversity (Yang et al., 2 Mar 2026).
5. Quantitative Effectiveness and Downstream Impact
Comprehensive ablation studies demonstrate the criticality of the Lottie tokenizer for model efficiency and generative performance:
- Token Compression: Raw Lottie JSON serialized with the Qwen2.5-VL tokenizer averages 2,562 tokens per file; Lottie tokenization reduces this by 81% (to 486 tokens).
- Task Performance:
- Text-to-Lottie Generation: Success rate (SR) improves from 13.4% (raw JSON) to 97.3% (tokenized); Fréchet Video Distance (FVD) decreases substantially (459.39 → 269.50); CLIP similarity increases (0.2600 → 0.2748).
- Video-to-Lottie Generation: SR grows from 10.2% to 90.7%; FVD from 431.23 to 281.95; objective and motion scores improve markedly (e.g., Obj: 1.33 → 4.31, Mot: 1.82 → 5.63).
- Text-Image-to-Lottie: Overall SR increases from 15.9% to 92.0%; FVD improves by 25–30%; CLIP alignment by ~12%; object/motion metrics from ~1–2 to ~4–5.
A plausible implication is that efficient tokenization not only enables major reductions in sequence length but also re-focuses model capacity toward animation semantics, yielding state-of-the-art results across all evaluated multi-modal vector animation tasks (Yang et al., 2 Mar 2026).
6. Significance for Generative Animation Modeling
The Lottie tokenizer constitutes the enabling mechanism for frameworks such as OmniLottie to utilize pretrained vision–language backbones effectively. By providing a concise and semantically aligned token stream, it ensures that learning resources are concentrated on animation-relevant factors—shapes, timing, visual style, motion—rather than wasted on syntactic or redundant artifacts of the original JSON representation. This suggests a general pathway for improved vector graphics generation from multi-modal inputs, facilitating both sequence modeling scalability and rigorous alignment with user instructions (Yang et al., 2 Mar 2026).