Large Vectorized Glyph Model

Updated 22 November 2025

LVGM is a transformer-based generative framework that models Chinese glyphs as ordered sequences of vector strokes, ensuring scalable and high-fidelity synthesis.
It employs a modified GPT-style transformer and vector quantization to accurately predict stroke sequences, enhancing stroke coherence and rendering quality.
Empirical evaluations on large-scale SVG datasets show LVGM achieves superior performance metrics and supports dynamic typography and artistic applications.

The Large Vectorized Glyph Model (LVGM) is a transformer-based generative framework designed for high-fidelity, scalable generation of vectorized Chinese glyphs. Leveraging a sequence modeling paradigm inspired by natural language processing, LVGM encodes and predicts ordered stroke sequences, treating each glyph as a complex, structured composition of vector strokes. Its architecture, pretraining methodology, and empirical performance distinguish LVGM as a state-of-the-art solution for dynamic vector glyph generation, particularly in large-scale typographical and artistic contexts (Zhang et al., 14 Nov 2025).

1. Problem Definition and Theoretical Motivation

LVGM formulates Chinese character synthesis as sequential prediction over sets of vector strokes. Each glyph $G$ is defined as an ordered list $G = \{S_1, S_2, ..., S_{N_g}\}$ , where each stroke $S_i$ is a vector path parameterized as an ordered sequence of cubic Bézier segments. The model’s objective is, given a partial prefix $\{S_1,...,S_k\}$ , to predict the remaining strokes $\{S_{k+1},...,S_{N_g}\}$ , ultimately reconstructing entire glyphs from possibly incomplete data.

The use of SVG-based vector strokes offers two crucial advantages:

Scalability: Vector representations allow glyphs to be rendered at any resolution without aliasing or pixel artifacts.
Sequence Coherence: Predicting strokes in order enforces the correct writing sequence, facilitating structural harmony and semantic plausibility across single characters, multi-character words, and full poetic verses.

This stroke-sequential approach generalizes beyond Chinese script, suggesting applicability to other ideographic systems and style-adaptive contexts.

2. Stroke Encoding, Quantization, and Embedding

Each stroke is encoded as a sequence of cubic Bézier drawing instructions: $B_C(t) = (1-t)^3 P_0 + 3t(1-t)^2 P_1 + 3t^2(1-t) P_2 + t^3 P_3, \quad t \in [0,1]$ For each instruction, only the six control point coordinates $(x_{P_0}, y_{P_0}, x_{P_1}, y_{P_1}, x_{P_2}, y_{P_2})$ are retained, normalized to $[-1,1]$ .

A sequence of these (“stroke representation”) is processed by a small convolutional encoder that maps each stroke to an $8 \times 16$ -dimensional continuous feature. This is quantized via vector quantization (VQ) into a discrete codebook $\mathcal{C}$ of size $K=30{,}000$ , resulting in each stroke being encoded as a tuple of 8 code indices: $e_s = E(s) = (z_1, ..., z_8), \quad z_j \in \{1, ..., K\}$ No additional positional encoding is required at the stroke level, as sequence ordering is handled by the transformer.

3. Transformer Architecture and Prediction Objective

LVGM employs the DeepSeek-Coder-1.3B transformer, a GPT-style LLM with 1.3 billion parameters. The architecture is adapted as follows:

Input Layer: replaces standard word-piece tokenization with stroke-embedding tokens (each $16$-dimensional, with 8 tokens per stroke).
Output Head: predicts the next codebook index via a linear-softmax layer over the $K=30{,}000$ vocabulary.
Separator Token: a learned {\tt<sep>} token delimits individual glyphs and multi-character sequences.

Given a token sequence of flattened stroke embeddings, the model is trained via a next-stroke prediction objective: $\mathcal{L}(\theta) = -\sum_{t=1}^T \log p_\theta(e_{s_t} | e_{s_{<t}})$ Training employs FlashAttention v2 and the TRL library for efficient optimization, but introduces no additional layers over the base transformer.

4. Training Dataset, Preprocessing, and Hyperparameters

LVGM is trained on the large-scale SVG-Strokes Dataset comprising 907,267 samples drawn from sources such as makemeahanzi, FZSJ-XIAOSXS, and annotated Tang Poems. Samples represent both regular script (744,810) and semi-cursive (162,457) styles, with glyphs containing up to 34 strokes and each stroke up to 49 Bézier curve segments. Manual annotation ensures preservation of semantic “stroke” boundaries and native writing order.

Preprocessing includes normalizing all SVG paths, reshaping to $6 \times 64$ then $6 \times 8 \times 8$ , and executing encode–quantize–decode in a bootstrapped two-stage process for high fidelity embeddings.

Training proceeds in two phases:

Stage 1 (VQ encoder/decoder): Adam optimizer, learning rate $1 \times 10^{-4}$ , batch size 128, 3,000 iterations.
Stage 2 (LLM fine-tuning): batch size 4, label smoothing 0.001, up to 5 epochs over 100,000 mixed samples on dual NVIDIA RTX A6000/A800 GPUs.

A scaling paper subsamples the training data to 10K, 25K, 50K, and 100K to empirically paper data-scaling laws.

5. Inference, Generation Capabilities, and Evaluation Metrics

The LVGM inference pipeline comprises three stages:

Encode and quantize a partial set of seed strokes to token sequences.
Autoregressively sample codebook indices for the remaining strokes, terminating upon reaching a {\tt<sep>} token or set maximum length.
Decode each embedding to recover SVG instructions for the final output glyph.

LVGM supports recovery of entire glyphs from as little as one or two initial strokes, and can compose semantically coherent words or previously unseen poetic verses in vector form. Expert and graduate student evaluators scored outputs in identifiability, aesthetics, and literary quality. The final aggregate scoring formula is: $\mathrm{Score}_{\rm Final} = 0.4 \cdot \mathrm{Ide} + 0.3 \cdot \mathrm{Aes} + 0.3 \cdot \mathrm{Lit}$ with weighting of expert to student judgments at 70%/30%.

Empirical quantitative metrics compared to prior vector font generators:

MSE = 0.0197 (lower is better)
PSNR = 17.13 dB (higher is better)
SSIM = 0.9432 (higher is better)
LPIPS = 0.0394 (lower is better)

6. Data-Scaling Laws, Human Evaluation, and Limitations

As the dataset size increases from 10K to 100K, observed behaviors include a monotonic decrease in training loss and an increase in the human-annotated final score (from approximately 3.3 to 4.68 out of 5). These findings mirror data-scaling laws observed in NLP sequence models.

Experts cited LVGM’s high stroke-level accuracy (“no missing or spurious strokes”), harmonious stroke thickness, and spatial layout. Limitations include minor jaggedness or insufficient smoothness due to Stage 1 compression and reduced accuracy for extremely rare or highly stylized glyphs, attributable to architectural and dataset capacity constraints.

7. Research Context, Extensions, and Open Questions

LVGM extends the paradigm of sequence modeling in NLP to structured vector graphics. It complements glyph embedding approaches such as those in ChineseBERT (Sun et al., 2021), which fuse bitmap glyph embeddings and pinyin information for text understanding. LVGM’s stroke-token quantization and autoregressive sequence modeling are orthogonal to latent SDF/glyph composition methods such as VecFontSDF (Xia et al., 2023), which leverage implicit shape primitives for reconstructive and generative tasks.

Identified future directions include:

Multilingual glyph generation (e.g., Japanese Kanji, Devanagari)
Style transfer and interpolation conditioned on designer input or external style codes
Deep integration into vector design tools
Fine-tuning with reinforcement learning to optimize for aesthetic or perceptual rewards
Multimodal inference where textual prompts synthesize bespoke glyphs
Open questions regarding joint global-local modeling, continuous style spaces, data licensing, and 3D/dynamic glyph modeling

LVGM thus constitutes an architectural and algorithmic advance in the generative modeling of complex vectorized scripts, supporting high-quality dynamic typeface synthesis, calligraphic art, and novel AI-driven typographical workflows (Zhang et al., 14 Nov 2025).