SVG-Strokes Dataset Overview
- SVG-Strokes dataset family is a collection of vector-based representations of sketches, icons, and glyphs with stroke-level annotation across diverse categories.
- Each dataset employs distinct encoding and normalization schemes that facilitate detailed analysis and deep learning model benchmarking for segmentation and generation tasks.
- The annotated datasets underpin various applications including icon synthesis, handwriting analysis, and typographic character generation with rigorous quality controls.
The SVG-Strokes dataset refers to a family of datasets providing vector-based representations of sketches, glyphs, or icons where the fundamental unit is the parametric “stroke.” These datasets have enabled the development and benchmarking of deep neural generative or discriminative models for hand-drawn symbol segmentation, SVG icon generation, and large-scale vector glyph synthesis. Distinct datasets under the SVG-Strokes umbrella have been released for: (1) hand-drawn semantic symbols (Kaiyrbekov et al., 2019), (2) large-scale iconography (Carlier et al., 2020), and (3) Chinese character glyphs (Zhang et al., 14 Nov 2025). Each dataset is characterized by rigorous annotation and systematic parameterization at the level of paths or strokes, facilitating a range of modeling and analysis tasks.
1. Dataset Families and Composition
The original SVG-Strokes dataset (Kaiyrbekov et al., 2019) comprises 2,500 human-drawn sketches across five categories (“airplane,” “cat,” “chair,” “firetruck,” “flower”), each category containing 500 samples. Each sketch consists of an ordered list of strokes, where each stroke is semantically annotated by domain expert annotators. Semantic labels are category-specific; for instance, “cat” includes nine labels (body, ear, eye, head, leg, mouth, nose, tail, whisker), and “firetruck” includes seven (body, cab, ladder, light, water hose, window, wheel).
The SVG-Icons8 dataset (Carlier et al., 2020), referenced in some codebases as an “SVG-Strokes” dataset, contains 100,000 distinct SVG icons spanning 56 iconographic categories (e.g., “arrows,” “animals,” “interface”). Each icon is an ordered set of SVG paths (≈7.4 per icon on average), each path encoded as a fixed-length command sequence.
The SVG-Strokes Chinese Glyph dataset (Zhang et al., 14 Nov 2025) comprises 907,267 samples of Chinese character images, divided into 744,810 regular-script and 162,457 semi-cursive glyphs, covering up to 9,574 distinct characters. The dataset is subdivided by semantic category (words, idioms, verses) with each SVG file representing a glyph composed of ordered strokes consistent with canonical writing order.
| Dataset | Total Samples | Primary Categories | Label Level | Typical Use Case |
|---|---|---|---|---|
| SVG-Strokes Symbols | 2,500 | 5 | Stroke, Semantic Part | Sketch parsing, segmentation |
| SVG-Icons8 | 100,000 | 56 | Path (no semantics) | SVG icon synthesis, animation |
| SVG-Strokes Glyphs | 907,267 | 3 (word/idiom/verse) | Stroke, Stroke Order | Character/glyph generation |
2. Data Structure and Encoding Schemes
The SVG-Strokes symbols dataset stores each sketch as an NDJSON record containing an identifier, a "drawing" field (array of strokes), and a "labels" field (semantic class per stroke). Each stroke is a sequence of 5D vectors , representing pen movement and state. Pen states use a one-hot encoding for “pen down,” “pen up,” and “pen lifted.” Strokes are semantically labeled according to pre-defined per-category components. Directory structure includes one JSON file per category under a central directory.
SVG-Icons8 represents each icon as a fixed-size set of SVG paths, each path as a fixed-length sequence of commands. Each command is a tuple , with (command type) in the set {<SOS>, MoveTo, LineTo, CubicBezier, ClosePath, <EOS>}, and for positional and control parameters. Data is serialized for both native SVG/XML and PyTorch/NumPy tensor formats.
The SVG-Strokes Glyph dataset encodes each glyph as a standalone SVG file; each path represents a full calligraphic stroke ordered as in canonical writing. Paths are standardized by converting all segments to cubic Bézier form, expressed as arrays of up to 64 sequential 6D parameter tuples (one Bézier segment per set of six values), with normalization to a coordinate box. Auxiliary files manifest split mappings and per-stroke statistics.
3. Annotation Protocols and Quality Control
For SVG-Strokes symbols (Kaiyrbekov et al., 2019), expert annotators selected well-drawn QuickDraw! samples that contain a complete set of expected subparts and diverse drawing styles. Every stroke was labeled with its semantic component per category-specific definitions. When strokes ambiguously intersected multiple parts, the dominant semantic part determined the label. Quality control included double-checking bounding box overlaps and second-level spot-checking of 10% of the dataset for labeling consistency, with conflicts resolved by discussion.
SVG-Icons8 does not provide per-path semantic annotations; path attributes are limited to visibility and fill type (outline, fill, erase). The only grouping is at the icon level. All files share a normalized canvas layout.
The SVG-Strokes Glyph dataset ensures each path corresponds to a calligraphic stroke, with order matching traditional writing conventions. Command conversion (quadratic-to-cubic Béziers) and normalization enforce uniformity across calligraphic styles and categories. Auxiliary metadata supports downstream selection by semantic category, style, stroke count, and instruction count.
4. Preprocessing, Normalization, and Format Standardization
SVG-Strokes symbols sketches are normalized by scaling the drawing to the [0, 255] plane, resampling strokes to have ~1 pixel between consecutive points, simplifying curves using Ramer–Douglas–Peucker ( px), and discarding strokes shorter than 15 pixels. The pen-movement representation () preserves local curvature; no further smoothing is applied.
SVG-Icons8 icons are all rescaled to a viewBox with paths re-oriented to ensure a consistent starting point. All relative commands are rewritten as absolute. Any non-basic path primitive (arc, ellipse, etc.) is converted to cubic Bézier or line sequences. Segments are recursively subdivided to maintain px and min angle , followed by simplification algorithms (Ramer–Douglas–Peucker or Schneider’s for cubic segments). Command arguments are quantized into 8-bit bins plus an “unused” bin.
The SVG-Strokes Glyph dataset processes all paths to cubic Bézier form and applies linear scaling/translating to the 0 box. Each stroke is either padded or truncated to 64 segments, resulting in a fixed-shape 1 tensor for every stroke, easing downstream modeling. VQ-VAE style vector quantization is applied to map each stroke representation into eight discrete codebook indices from a set of 30,000, facilitating discrete modeling (Zhang et al., 14 Nov 2025).
5. Benchmarking Tasks, Metrics, and Baselines
The SVG-Strokes symbols dataset supports two central tasks: stroke-level semantic segmentation and generative sketch reconstruction. For segmentation, the benchmark is component-level accuracy, i.e., the proportion of correctly labeled strokes. The loss is weighted cross-entropy, with weights 2, where 3 is the number of strokes of class 4. Baseline models include SVM+IDM feature sets and pixel-based methods (CRF, CNN), against which the NN-based approach achieves 91.9% average accuracy, outperforming SVMs (83.8%).
For reconstruction, a stroke-rnn VAE model is trained per-sketch, optimizing the variational autoencoder objective: 5 with the total loss incorporating data, pen-state, and KL-divergence terms, with annealing on the KL weight. Quantitative evaluation is by negative log-likelihood and qualitative temperature-controlled reconstructions.
The SVG-Icons8 framework supports hierarchical generative modeling for vector graphics, with PyTorch dataset loading and command-level or icon-level embedding statistics available directly from provided scripts.
The SVG-Strokes Glyph dataset supports vectorized character generation and sequence modeling. The dataset is explicitly used to train LLM-fine-tuned models (e.g. LVGM) to predict the next stroke given context, with evaluation based on identification ratings and scaling behavior analyses.
6. Distribution, Licensing, and Citation
SVG-Strokes symbols (Kaiyrbekov et al., 2019) is distributed under Creative Commons Attribution 4.0 (CC-BY-4.0) via https://github.com/kurmanbekov/svg-strokes-dataset. The repository contains the JSON-encoded drawings per category, stroke-level labels, and a detailed README.
SVG-Icons8 (Carlier et al., 2020) (sometimes called “SVG-Strokes” in code) is distributed under open-source terms and is available for direct pip installation and as PyTorch/NumPy data via https://github.com/alexandre01/deepsvg, with full access to the 100k raw SVG icon files and preprocessing metadata.
The SVG-Strokes Chinese Glyph dataset (Zhang et al., 14 Nov 2025) is slated for open-source release (CC-BY 4.0 or Apache 2.0) via GitHub and Zenodo. Each sample is an SVG file containing paths ordered by canonical writing sequence, together with a manifest, split files, and per-sample statistics. Citation of the originating paper is required for academic use.
7. Research Impact and Applications
The SVG-Strokes datasets have established standardized benchmarks for stroke-based analysis, generative modeling of sketches, icon generation, and vector glyph synthesis. The collaborative annotation, consistent low-level representation, and explicit labeling protocols set by the symbols dataset have driven advances in neural network segmentation and sketch-based symbol understanding (Kaiyrbekov et al., 2019). SVG-Icons8 has facilitated hierarchical generative approaches to vector data, decoupling high-level visual structure from low-level rendering commands (Carlier et al., 2020). The SVG-Strokes glyph corpus has enabled token-level generative sequence modeling for typographically accurate character synthesis, directly supporting the fine-tuning and scaling experiments in LLM-based glyph generation (Zhang et al., 14 Nov 2025). The datasets collectively underpin a broad array of applications, including animation, handwriting analysis, iconography, and stylized font generation.