Papers
Topics
Authors
Recent
2000 character limit reached

Chinese SVG Dataset Overview

Updated 22 November 2025
  • Chinese SVG Dataset is a comprehensive resource that decomposes 92,560 Hanzi into approximately 11,000 reusable vector components for scalable font generation.
  • It employs detailed JSON/CSV annotations and six primary layout types to enable precise glyph synthesis and efficient model training.
  • The structured metadata and SVG outlines facilitate advanced typographic workflows, zero-shot extension, and digital font engineering.

The Chinese SVG Dataset, as introduced by the work "Efficient and Scalable Chinese Vector Font Generation via Component Composition" (Song et al., 10 Apr 2024), is a large-scale, component-centric dataset designed for efficient vector font generation and manipulation of Chinese characters in scalable vector graphics (SVG) format. The dataset systematically decomposes 92,560 Hanzi (Chinese characters) into minimal, reusable components, together with rich structural, spatial, and vector outline metadata, facilitating scalable character synthesis, model training, and advanced digital typography workflows.

1. Scope, Composition, and Coverage

The dataset encompasses 92,560 Unicode-encoded Chinese characters covering all CJK Unified Ideographs, including Extensions A–F. These characters are disassembled into approximately 11,000 unique components, where each component is defined as a self-contained stroke-group capable of independent use in composition; radicals form a strict subset of these components. Notably, a core set of approximately 1,100 "base" components can generate over 60,000 characters through recursive compositionality.

Layout annotation relies on six principal categories—NL00 (Isolated), NL01 (Left–Right), NL02 (Top–Bottom), NL03 (Enclosed, eight subtypes), NL04 (Left–Middle–Right), and NL05 (Top–Middle–Bottom). Complex cases are represented as nested combinations, forming a tree structure over the primary layout types. Distribution of characters among these layouts includes 38% Left–Right, 27% Top–Bottom, 16% Enclosed, about 6% split between the three-part configurations, and 4% fully isolated. The ten most commonly used components each occur in 5,000–10,000 characters.

2. Data Structure, Representation, and Formats

All dataset metadata and annotations utilize open, non-proprietary JSON/CSV schemas. Each character entry in decomposition.json (or corresponding CSV/TSV) links the Unicode codepoint to its structural decomposition, explicit component list, and layout type. Each component’s Unicode point is cataloged in components.json, and a separate file specifies layout definitions.

SVG outlines are provided for every component across seven reference fonts: SourceHanSans, BableStone, AlibabaPuHuiTi, OPPOSansR, SmileySans, YouAiYuanTi, WenDingKaiTi. Each glyph is described as cubic Bézier path data within the SVG, normalized on a 256×256 raster grid (bottom-left origin matching standard font units-per-em), suitable for direct vector processing or rasterization.

Each SVG includes auxiliary metadata such as the original font bounding box and explicit control-point coordinates, supporting precise geometric manipulation. Supporting files include per-glyph bounding boxes (in [0,1] normalized em-box units) and decomposition trees for structurally nested characters.

3. Annotation Principles and Manual Disassembly

Disassembly—fragmentation of Hanzi into irreducible components—is performed via human annotation, employing both crowd-sourcing and expert review to ensure high fidelity. Components are listed in reading order corresponding to the layout: left-to-right or top-to-bottom, according to their NL class. Nested structures are captured as JSON trees, tagging each substructure with corresponding layout and associated children. Bounding box anchors per component provide explicit spatial ground truth for affine placement, directly supporting downstream model training and geometric synthesis tasks.

4. Mathematical Formulations and Loss Functions

The core modeling approach leverages spatial transformer networks (STNs) that predict, for each component, a 2×3 affine transformation matrix

A=[abtx cdty]A = \begin{bmatrix} a & b & t_x \ c & d & t_y \end{bmatrix}

applied as p=[ab cd]p+[tx ty]p' = \begin{bmatrix} a & b \ c & d \end{bmatrix} p + \begin{bmatrix} t_x \ t_y \end{bmatrix} for each Bézier control point p=(x,y)Tp = (x, y)^T.

Four primary loss functions guide the training and evaluation of the STN regressor:

  1. Pixel (L₁) loss:

Lpixel=SC1L_{\text{pixel}} = \|S - C\|_1

  1. Overlap loss (penalizes spatial intersections):

Loverlap=x,ymin(max(S(x,y)1,0),1)x,yS(x,y)L_{\text{overlap}} = \frac{\sum_{x, y} \min(\max(S(x, y) - 1, 0), 1)}{\sum_{x, y} S(x, y)}

for binary masks S>0S > 0.

  1. Centroid loss (matches first raw moments):

Lcentroid=12(μx(S)μx(C)1+μy(S)μy(C)1)L_{\text{centroid}} = \frac{1}{2}\left( \|\mu_x(S) - \mu_x(C)\|_1 + \|\mu_y(S) - \mu_y(C)\|_1 \right)

where μx(I)=Φ(I;1,0)/Φ(I;0,0)\mu_x(I) = \Phi(I; 1,0)/\Phi(I; 0,0) and Φ(I;i,j)=I(x,y)xiyj\Phi(I; i, j) = \sum I(x, y) x^i y^j.

  1. Inertia loss (matches second central moments):

Linertia=Ψ(S)Ψ(C)L_{\text{inertia}} = |\Psi(S) - \Psi(C)|

with Ψ(I)=Φ(I;2,0)[Φ(I;1,0)]2/Φ(I;0,0)+Φ(I;0,2)[Φ(I;0,1)]2/Φ(I;0,0)\Psi(I) = \Phi(I; 2,0) - [\Phi(I; 1,0)]^2/\Phi(I; 0,0) + \Phi(I; 0,2) - [\Phi(I; 0,1)]^2/\Phi(I; 0,0).

The aggregate loss combines these with layout-dependent weights: L=αLpixel+βLoverlap+γLcentroid+θLinertia\mathcal{L} = \alpha L_{\text{pixel}} + \beta L_{\text{overlap}} + \gamma L_{\text{centroid}} + \theta L_{\text{inertia}}.

5. Accessibility, Organization, and Licensing

A preview version of the dataset is publicly accessible online. Full access can be requested via the corresponding author. The directory structure is as follows:

Directory/File Content Summary Purpose
decomposition.json/csv Character-to-components decomposition Structural queries, data loading
components.json List of all component codepoints Dictionary of building blocks
fonts/ SVG glyphs for each reference font Outline extraction, font adaptation
annotations/bounding_boxes.json Bounding-boxes for components STN training and evaluation
annotations/trees/ Nested layout decomposition trees Hierarchical analysis

The dataset is released under the Creative Commons BY-NC 4.0 license for academic and non-commercial use. Citation of the associated IJCAI’24 publication is required for downstream research or tool development.

6. Use Cases and Integration Pathways

The dataset enables direct lookup of any constituent component and its role in over 92,000 Hanzi, supports SVG path data extraction for vector-based synthesis, and allows immediate application or fine-tuning of STN-based regressors on custom fonts. The detailed structural and geometric metadata supports seamless integration into typical SVG/TTF/OTF font toolchains, facilitating large-scale Chinese font generation, extension, and precise vector manipulation in computer graphics, digital publishing, and typographic analysis (Song et al., 10 Apr 2024).

A plausible implication is that by capturing high-precision decompositions and providing training-ready geometric anchor data, the dataset substantively lowers the barrier for scalable, data-driven Chinese font engineering, fostering advances in zero-shot extension, stylistic interpolation, and intelligent glyph synthesis within vector graphic frameworks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Chinese SVG Dataset.