Papers
Topics
Authors
Recent
Search
2000 character limit reached

Lottie Tokenization for Neural Animation

Updated 4 March 2026
  • Lottie Tokenization is the process of converting raw JSON vector animations into discrete tokens that efficiently capture key visual, temporal, and stylistic elements.
  • The method decomposes Lottie files into command, numeric, and text tokens, reducing sequence lengths by up to 81% and simplifying model training.
  • Experimental results demonstrate that tokenized data boosts animation synthesis performance, evidenced by higher success rates and improved generative metrics.

Lottie tokenization refers to the process of transforming raw Lottie JSON vector animation files into structured, compact sequences of discrete tokens suitable for neural generative modeling. The principal objective is to encode all relevant semantic content—such as animation shapes, colors, transforms, keyframe timing, and motion curves—while excluding invariant, redundant, or inert metadata and low-level JSON syntax. This representation is foundational to frameworks like OmniLottie, which leverage tokenized Lottie data for training generative vision–LLMs capable of flexible, high-quality animation synthesis from multi-modal instructions (Yang et al., 2 Mar 2026).

1. Formal Problem Formulation

The Lottie tokenization problem can be framed as converting a raw Lottie JSON file JJ into a linear discrete token sequence TT that is maximally informative for vector animation generation while minimizing irrelevant syntactical and structural overhead. The hierarchical structure of Lottie files is decomposed into:

  • Base metadata MM: {v,fr,ip,op,w,h,nm,ddd}\{v, fr, ip, op, w, h, nm, ddd\}, including version, framerate, input/output times, dimensions, and project-level settings.
  • Layer set L={L1,...,LN}\mathcal{L} = \{L_1, ..., L_N\}: Each layer LiL_i is a tuple (τi,Aτi,Ti,Ei)(\tau_i, A_{\tau_i}, T_i, E_i) with:
    • τi\tau_i: layer type {0,1,3,4,5}\in \{0,1,3,4,5\} (Precomp, Solid, Null, Shape, Text)
    • AτiA_{\tau_i}: static attributes (e.g., shapes, text content)
    • TiT_i: per-layer transform properties (position, scale, rotation, opacity, skew)
    • EiE_i: per-layer effect stacks (masks, blend modes, effects)

Tokenization involves removing all inert metadata (e.g., base64 blobs, camera/audio layers, redundant flags) and JSON formatting tokens. Continuous numeric parameters xx (e.g., position, scale) are quantized and assigned a unique integer token via mapping token(p,t)=pst+ottoken(p, t) = \lfloor p \cdot s_t \rfloor + o_t, where sts_t is a scale and oto_t is a vocabulary offset. Textual fields are tokenized via the backbone's pretrained text tokenizer VtextV_{text} (Yang et al., 2 Mar 2026).

2. Taxonomy of Token Types

The Lottie tokenizer operationalizes three principal token families:

  • Command Tokens
    • <META>: Initiates encoding of base metadata MM.
    • <LAYER-\tau>: Starts a new layer block of type τ\tau.
    • <END>: Terminates the current layer block.
  • Numeric-Parameter Tokens
    • Each numeric property is mapped to a distinct integer token sub-range via offset and quantization specific to its semantic type:
    • Temporal (in/out points, keyframes)
    • Spatial (position, anchor)
    • Scale (x/y scaling)
    • Rotation (degrees)
    • Opacity (0–100)
    • Style (stroke width, corner radius)
    • Color (R/G/B, 0–255)
    • Easing-curve control points
    • Example: otemporalo_{temporal} = 1000, stemporals_{temporal} = 10; ospatialo_{spatial} = 20000, sspatials_{spatial} = 1.
  • Text-Field Tokens
    • String-valued fields (layer name, content, font family) are tokenized by the backbone VtextV_{text} and serialized as a length-prefixed sequence: [n,tok1,...,tokn][n, tok_1, ..., tok_n].

3. Algorithmic Workflow and Example

The conversion from Lottie JSON to tokens proceeds via a deterministic sequence:

  1. Remove non-parameterizable layers (images, audio, camera), base64 blobs, and After Effects (AE) expressions.
  2. Extract base metadata MM.
  3. Initialize the token sequence as T[<META>]T \gets [<META>].
  4. Quantize each xMx \in M using token(x,type(x))token(x, \text{type}(x)) and append to TT.
  5. For each layer LiL_i:
    • Append <LAYERτi><LAYER-\tau_i>.
    • For each numeric parameter pp in (parent,index,ip,op,sr,st,Ti,Ei)(parent, index, ip, op, sr, st, T_i, E_i): append token(p,type(p))token(p, \text{type}(p)).
    • For each string field GG in AτiA_{\tau_i}: tokenize with Vtext(G)V_{text}(G), serialize [n,tok1,...,tokn][n, tok_1, ..., tok_n].
    • Append <END><END>.

Example (simplified):

Given a basic rectangle animation,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
{
  "v": "5.9.0",
  "fr": 30,
  "ip": 0, "op": 60,
  "w": 512, "h": 512,
  "layers": [
    {
      "ty": 4, "nm": "MyRect",
      "ks": {
        "p": {"k": [256, 256]},
        "s": {"k": [100, 100]},
        "r": {"k": 0}
      },
      "shapes": [
        {
          "ty": "rc",
          "p": {"k": [256, 256]},
          "s": {"k": [200, 100]},
          "r": {"k": 10}
        }
      ]
    }
  ]
}
becomes:
1
2
3
4
5
6
7
8
9
10
11
12
13
[<META>, token("5.9.0", ver), token(30, fr), token(0, ip), token(60, op),
 token(512, w), token(512, h), token(0, ddd),
 <LAYER-4>,
  token(256, spatial), token(256, spatial),
  token(100, scale), token(100, scale),
  token(0, rotation),
  [2, idx("My"), idx("Rect")],
 <END>,
 <LAYER-4-shape>,
  token(256, spatial), token(256, spatial),
  token(200, scale), token(100, scale),
  token(10, cornerRadius),
 <END>]
(Yang et al., 2 Mar 2026)

4. Design Principles and Heuristics

Several key heuristics underpin the Lottie tokenizer:

  • Core vs. Conditional Fields: Only serialize auxiliary blocks (assets, markers, fonts, chars) when referenced by a layer, pruning otherwise.
  • Offset-Based Quantization: Offset and scale values (ot,st)(o_t, s_t) for each parameter type are calibrated empirically to maximize token-range utilization and prevent overlap.
  • Text-Token Integration: Pretrained VtextV_{text} is reused for string fields, leveraging language understanding inherent in the backbone.
  • Structural Marker Reduction: All JSON punctuation is mapped to three control tokens, eliminating superfluous tokens from braces, commas, or nested arrays.
  • Pad Values for Optionals: Reserved pad tokens PADt=ot1PAD_t = o_t - 1 ensure fixed slot allocation without vocabulary bloat.
  • Invariant Field Pruning: Defaults (e.g., ddd:0ddd:0, unused blend modes) are omitted when present.
  • Synthetic Data Mixture: Mixing 70% native Lottie data with 30% SVG-derived synthetic samples improves model generality and balances motion/statistics diversity (Yang et al., 2 Mar 2026).

5. Quantitative Effectiveness and Downstream Impact

Comprehensive ablation studies demonstrate the criticality of the Lottie tokenizer for model efficiency and generative performance:

  • Token Compression: Raw Lottie JSON serialized with the Qwen2.5-VL tokenizer averages 2,562 tokens per file; Lottie tokenization reduces this by 81% (to 486 tokens).
  • Task Performance:
    • Text-to-Lottie Generation: Success rate (SR) improves from 13.4% (raw JSON) to 97.3% (tokenized); Fréchet Video Distance (FVD) decreases substantially (459.39 → 269.50); CLIP similarity increases (0.2600 → 0.2748).
    • Video-to-Lottie Generation: SR grows from 10.2% to 90.7%; FVD from 431.23 to 281.95; objective and motion scores improve markedly (e.g., Obj: 1.33 → 4.31, Mot: 1.82 → 5.63).
    • Text-Image-to-Lottie: Overall SR increases from 15.9% to 92.0%; FVD improves by 25–30%; CLIP alignment by ~12%; object/motion metrics from ~1–2 to ~4–5.

A plausible implication is that efficient tokenization not only enables major reductions in sequence length but also re-focuses model capacity toward animation semantics, yielding state-of-the-art results across all evaluated multi-modal vector animation tasks (Yang et al., 2 Mar 2026).

6. Significance for Generative Animation Modeling

The Lottie tokenizer constitutes the enabling mechanism for frameworks such as OmniLottie to utilize pretrained vision–language backbones effectively. By providing a concise and semantically aligned token stream, it ensures that learning resources are concentrated on animation-relevant factors—shapes, timing, visual style, motion—rather than wasted on syntactic or redundant artifacts of the original JSON representation. This suggests a general pathway for improved vector graphics generation from multi-modal inputs, facilitating both sequence modeling scalability and rigorous alignment with user instructions (Yang et al., 2 Mar 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Lottie Tokenization.