InternSVG: Unified SVG Modeling

Updated 19 October 2025

InternSVG is a unified multimodal large language model that uses SVG-specific tokenization and joint training to address understanding, editing, and generation tasks.
SAgoge is an extensive SVG-centric dataset with over 16 million samples from static, illustrated, chemical, and animated SVGs, enabling robust multi-level supervision.
SArena benchmark standardizes evaluation across icon, illustration, chemistry, and animation tasks, demonstrating significant performance improvements over prior methods.

InternSVG refers to a unified multimodal LLM (MLLM) architecture, an extensive SVG-centric dataset ("SAgoge"), and an accompanying benchmark ("SArena") designed to address the spectrum of scalable vector graphics (SVG) tasks: understanding, editing, and generation. The InternSVG system is specifically developed to overcome entrenched challenges in SVG modeling, including dataset fragmentation, weak cross-task transferability, and the handling of SVG structural complexity. Its core innovation lies in combining SVG-specific encoding schemes, a curriculum-based training strategy, and a comprehensive evaluation protocol, resulting in superior performance and generalization across static, scientific, and animated SVG domains (Wang et al., 13 Oct 2025).

1. Unified SVG Task Formulation

InternSVG is constructed as a single MLLM capable of processing SVG tasks from three domains: understanding, editing, and generation. This unification is achieved through joint modeling, which enables positive transfer between related tasks. The model receives input as either SVG code, rendered images, or natural language prompts (text/image), and produces outputs that may be SVG code, answers to semantic questions regarding SVG structure, edited SVGs per specified instructions, or newly synthesized SVG graphics.

Key to InternSVG's approach is the application of an extended tokenizer that introduces 55 SVG tag tokens (e.g., <svg>, <path>, <circle>, <animateTransform>) and 42 attribute tokens (e.g., viewBox, fill). These tokens directly represent SVG semantics, vastly compressing SVG sequences and facilitating more accurate code understanding and autoregressive generation. The token inventory also defines numerical tokens covering the full range required for SVG coordinate and attribute encoding, eliminating common ambiguities in parsing SVG numerical details.

2. SAgoge Dataset Structure and Diversity

The SAgoge dataset is central to InternSVG’s training. It surpasses previous datasets in scale and variety, comprising over 16 million training samples that include:

Static SVGs: Icons, general illustrations, and scientific diagrams (notably chemical formula SVGs)
Long-range Illustrations: Complex multi-object drawings requiring long-token sequence modeling and hierarchical reasoning
Dynamic Animations: SVGs annotated with temporal transformations, including <animate> and <animateTransform> elements for motion synthesis

These data types are annotated for multi-level SVG structure (XML tree depth, attributes, relationships) and offer granular coverage across basic to expert-level SVG constructs. This diversity supports cross-task supervision, allowing the model to learn transferable representations and compositional reasoning that conventional, single-domain datasets do not provide.

3. SArena Benchmark Design

SArena is a tightly coupled benchmark suite that operationalizes the breadth of SAgoge’s data, providing standard protocols for measurement and comparison. It is partitioned into four principal SVG domains:

Domain	Example Task	Key Metric
Icon	Multiple-choice QA	Accuracy, PSNR, FID
Illustration	Text/Image generation	FID, FID-CLIP, CLIPScore
Chemistry	Formula generation	FID, CLIP-I2I, correctness
Animation	Text/Video-to-SANI	Human preference, FID

Understanding tasks evaluate semantic inference from SVG code, including structural and attribute reasoning. Editing tasks span both pixel-level (LPIPS, SSIM, PSNR) and semantic-level modifications. Generation tasks are measured using standard perceptual and distributional metrics including FID (Fréchet Inception Distance), CLIPScore, and FID-CLIP, with additional human preference studies for animation.

4. Model Architecture and Training Methodology

InternSVG employs a dual-stream architecture combining a large ViT-based visual encoder (InternViT-300M) and a transformer LLM backbone (Qwen2.5-7B). SVG-specific special tokens are mapped to subword embeddings using averaged subword vectors, preserving semantic priors. This subword-based embedding initialization (Editor's term) ensures rapid convergence and semantic fidelity.

Training proceeds in two progressive stages:

Stage One: Focused on short, simpler SVGs (icons, chemistry) to anchor the model’s capability in basic semantic and structural patterns
Stage Two: Expands to complex, long-sequence illustrations and animated SVGs, with balanced domain sampling to avoid overfitting

This staged curriculum mitigates issues of data imbalance, enables hierarchical learning, and supports scalability to increasingly sophisticated SVG scenarios.

5. Experimental Evaluation and Performance

InternSVG’s unified approach yields consistent and substantial performance improvements across domains. On SArena-Icon, InternSVG attains an understanding accuracy improvement of ~8 points over top proprietary models (e.g., Claude-Sonnet-4), higher editing PSNR, and lower generation FID. For illustration tasks, it produces sequences with optimal FID and competitive CLIPScore. Chemistry SVG generation achieves the lowest FID and highest semantic correctness due to robust representation of compound structures, outpacing prior art. In animation tasks, performance approaches the best proprietary results in both Text-to-SANI and Video-to-SANI.

Ablation studies indicate that joint training on understanding, editing, and generation yields further gains: positive task transfer among SVG understanding (QA), nuanced editing, and structured generation is empirically confirmed.

6. Applications and Implications

InternSVG’s unified modeling paradigm supports a spectrum of high-value applications:

Digital and Web Design: Rapid, interpretable generation and editing of scalable graphics—icons, logos, and interactive illustrations
Scientific Visualization: Synthesis of accurate chemical diagrams and technical SVGs from text descriptors
Dynamic Graphics and Interfaces: Generation and manipulation of animated SVGs for user interface and interaction design

A plausible implication is that the unified model and SVG-specialized tokenization may foster further advances in vector-graphics reasoning within MLLMs. The curriculum training and efficient encoding strategies suggest directions for scalable, multi-domain graphic reasoning models.

7. Future Directions

InternSVG establishes a foundation for subsequent research into unified, multimodal SVG models. The scale and diversity of SAgoge, efficiency of subword-initialization, and positive cross-task transfer evidenced in experiments indicate promising avenues for fully integrated graphic code understanding and generation. Further development may extend toward more complex visual domains, higher hierarchical reasoning, and integration of dynamic scene understanding. This suggests rapid iteration and refinement of industry-standard SVG workflows is increasingly feasible with unified MLLM architectures.

InternSVG and its accompanying dataset and benchmark represent a comprehensive solution for modeling, editing, and generating SVGs, addressing fragmentation in previous approaches and establishing a scalable, interpretable, and high-performing foundation for future advancements in vector graphic AI systems (Wang et al., 13 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models (2025)

Follow Topic

Get notified by email when new papers are published related to InternSVG.

InternSVG: Unified SVG Modeling

1. Unified SVG Task Formulation

2. SAgoge Dataset Structure and Diversity

3. SArena Benchmark Design

4. Model Architecture and Training Methodology

5. Experimental Evaluation and Performance

6. Applications and Implications

7. Future Directions

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

InternSVG: Unified SVG Modeling

1. Unified SVG Task Formulation

2. SAgoge Dataset Structure and Diversity

3. SArena Benchmark Design

4. Model Architecture and Training Methodology

5. Experimental Evaluation and Performance

6. Applications and Implications

7. Future Directions

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research