VecGlypher: Unified Vector Glyph Generator

Updated 28 February 2026

VecGlypher is a unified, multimodal system that generates high-fidelity, editable vector glyphs directly from text descriptions or glyph image exemplars.
It leverages a large-scale autoregressive transformer to output SVG path tokens using quantized absolute coordinates for high geometric precision.
The two-stage training pipeline, incorporating extensive font datasets and manual style alignment, ensures robust out-of-distribution performance and downstream editability.

VecGlypher is a unified, multimodal language-model–based vector glyph generation system capable of producing high-fidelity, editable vector outlines directly from text descriptions or glyph image exemplars. Unlike prior pipelines that depend on intermediate raster representations or disjointed postprocessing, VecGlypher leverages a large-scale autoregressive transformer trained to output SVG path tokens—ensuring geometric fidelity, style control, and downstream editability, all in a scalable, prompt-driven workflow (Huang et al., 25 Feb 2026).

1. Core Model Architecture

VecGlypher operates atop a multimodal transformer (e.g., Gemma3-27B or similar), which is fine-tuned to autoregressively generate SVG-path token sequences given a combination of text prompts, optional image exemplars, and target content codepoints.

Input sequence: Comprised of (a) a style prompt containing free-form descriptive tags (e.g., “thick slab serif, high contrast”), (b) up to three reference glyph images encoded by the model’s vision backbone, (c) the Unicode codepoint of the target glyph, and (d) a special “<SVG>” token signaling the start of vector output.
SVG tokenization: From the “<SVG>” token, the model’s output vocabulary restricts to discrete drawing commands (MoveTo, LineTo, CurveTo, ClosePath) interleaved with quantized absolute coordinate tokens. The token stream forms an explicit serialization of SVG path syntax.
Training objective: The model is supervised under a causal next-token–prediction loss:

$L = -\sum_{t=1}^T \log P(\text{token}_t \mid \text{token}_{<t}, \text{text\_prompt}, [\text{image\_refs}])$

This architecture enables both pure text-driven and image-conditioned generation of vector glyphs, with cross-attention to the image encoder activated during the style-alignment stage (Huang et al., 25 Feb 2026).

2. Data Preparation, Normalization, and SVG Tokenization

VecGlypher’s two-stage data recipe is essential for stabilizing long-sequence decoding and aligning geometric trajectories with multimodal prompts:

Stage 1 ("Learning to Draw"): Trained on 39K commercial fonts from Envato, covering the ASCII/Latin set, extracting TrueType outlines and Unicode labels. Families are de-duplicated to reduce design bias.
Stage 2 ("Style and Modality Alignment"): Refined on 2.5K Google Fonts with manually annotated style descriptors and up to three raster exemplars per glyph.

Preprocessing steps:

Outlines normalized to a canonical frame: for each glyph, coordinates $(x, y)$ are projected to $[0, 1]$ and re-scaled to a fixed units-per-em (UPM=1000) grid.
Quantization to 0.1-unit increments (maximum error $<$ 0.05/1000 em): coordinates are

$x_q = \text{round}(x' \times 10000)/10$

All path commands are canonicalized to absolute M/L/Q/Z with no relative or arc commands present in the training data.
The token vocabulary consists of [M], [L], [Q], [Z], and $\mathrm{COORD}_{0:10000}$ integer tokens, interleaved to represent the sequence:

$[M] [x_1] [y_1] [L] [x_2] [y_2] [Q] [x_3] [y_3] [x_4] [y_4] ... [Z]$

This serializes complex glyph contours as unambiguous, watertight SVG paths, and is critical for robust LLM decoding (Huang et al., 25 Feb 2026).

3. Training Methodology and Multimodal Conditioning

The training procedure is divided into two stages:

Stage 1: The model is trained via autoregressive continuation on broad, noisy font data using only text prompts, with no vision encoder enabled. This teaches robust long-horizon SVG grammar and geometric control.
Stage 2: Post-trained on a curated, high-quality subset with both style text tags and image exemplars. Here, both cross-attention masking and descriptive tag alignment are enabled.

At inference, the model is primed by concatenating [STYLE_TAGS], [CONTENT_CODEPOINT], optional [<IMG_EMBS>], and [<SVG>], then samples path/coordinate tokens until a “Z” closes the contour and <EOS> marks glyph completion. This staged approach achieves tight style-geometry alignment and out-of-distribution family generalization (Huang et al., 25 Feb 2026).

4. Quantitative and Qualitative Evaluation

VecGlypher is evaluated on a cross-family OOD (out-of-distribution) test split with two primary modes: text-only (style tags + codepoint) and image-referenced (exemplar glyph image + codepoint).

Key metrics:

Model	Setting	R-ACC↑	CD↓	FID↓
VecGlypher-27B	Text	100.5	1.72	3.46
DeepVecFont-v2	Text	98.2	2.45	7.8
VecGlypher-27B	Img	99.12	1.18	2.32
DeepVecFont-v2	Img	—	—	—
DualVector	Text	95.6	3.10	9.1

Recognition accuracy (R-ACC), Chamfer distance (CD) between outlines, and FID (on rendered rasters) are the primary axes of comparison.
Generation speed is 14.7 glyphs/sec (27B model) on a single H200, matching or exceeding previous vector-specific baselines (Huang et al., 25 Feb 2026).

Ablation results:

Model scale and the staged training process are critical for achieving OOD generalization and geometric accuracy.
Absolute coordinate serialization yields better geometric fidelity (CD) than relative or mixed coordinate encodings.

5. Comparative Landscape

VecGlypher advances prior lines of work in both direct vector glyph generation and image-conditioned text-to-vector synthesis:

Contrast with DeepVecFont: DeepVecFont synthesizes vector glyphs via dual-modality (image and sequence) encoders, random MDN/GMM sampling in unstructured command space, and two-stage differentiable rasterization. It relies on a separate decoding and refinement stage for global alignment and achieves high-quality Bézier-curve fonts (Wang et al., 2021). VecGlypher, by contrast, eschews raster intermediates and directly emits SVG/coordinate sequences, enabling prompt-driven generation with full integration of language and vision cues.
Relation to VectorFusion and Grimoire: VectorFusion and Vector Grimoire both learn to generate SVG representations, but rely either on pixel-to-vector optimization under diffusion model guidance (Jain et al., 2022) or on codebook-based autoencoding of rasterized glyph patches followed by autoregressive token modeling (Feuerpfeil et al., 2024). VecGlypher generalizes these approaches by unifying text, image, and vector geometry in a single, scalable LLM framework, and does not require differentiable rasterization or vectorization at inference.
Implications for NLP: In the spirit of logographic embedding models such as Glyce (historical multi-script glyph CNNs and BERT fusion) (Meng et al., 2019) and glyph-aware CNNs for Chinese character segmentation and language modeling (Dai et al., 2017), VecGlypher could provide a direct path to multi-style, multi-language character generation pipelines with editable vector outputs. This suggests applications in type-driven NLP tasks and flexible multi-lingual font generation.

6. Extensions, Limitations, and Future Directions

VecGlypher introduces several architectural innovations and outlines promising extensions:

Unified autoregressive SVG emitter: Handles both text and image conditioning, producing full watertight vector paths in a single pass.
Staged SFT pipeline: Decouples geometry learning and style alignment for improved OOD robustness.
End-to-end editability: No postprocessing or intermediate rasterization required.
Planned directions:
- Extension to full Unicode (diacritics, CJK, etc.) using targeted fine-tuning.
- Addition of cubic Bézier and arc commands, contingent on grammar simplicity and model scalability.
- Enhanced context windows for multi-glyph or typeset-aware synthesis.
- Interactive, user-in-the-loop style refinement, leveraging the prompt-driven nature of multimodal LLMs.
- Efficient adaptation to new typographic domains via further supervision or structured disentanglement (Huang et al., 25 Feb 2026).

VecGlypher thus represents a convergence of multimodal generative modeling, scalable SVG path inference, and human-interpretable style control, marking a transition from raster-bound pipelines to fully vector-native, prompt-driven glyph and font design.