InternSVG: Unified SVG Modeling
- InternSVG is a unified multimodal large language model that uses SVG-specific tokenization and joint training to address understanding, editing, and generation tasks.
- SAgoge is an extensive SVG-centric dataset with over 16 million samples from static, illustrated, chemical, and animated SVGs, enabling robust multi-level supervision.
- SArena benchmark standardizes evaluation across icon, illustration, chemistry, and animation tasks, demonstrating significant performance improvements over prior methods.
InternSVG refers to a unified multimodal LLM (MLLM) architecture, an extensive SVG-centric dataset ("SAgoge"), and an accompanying benchmark ("SArena") designed to address the spectrum of scalable vector graphics (SVG) tasks: understanding, editing, and generation. The InternSVG system is specifically developed to overcome entrenched challenges in SVG modeling, including dataset fragmentation, weak cross-task transferability, and the handling of SVG structural complexity. Its core innovation lies in combining SVG-specific encoding schemes, a curriculum-based training strategy, and a comprehensive evaluation protocol, resulting in superior performance and generalization across static, scientific, and animated SVG domains (Wang et al., 13 Oct 2025).
1. Unified SVG Task Formulation
InternSVG is constructed as a single MLLM capable of processing SVG tasks from three domains: understanding, editing, and generation. This unification is achieved through joint modeling, which enables positive transfer between related tasks. The model receives input as either SVG code, rendered images, or natural language prompts (text/image), and produces outputs that may be SVG code, answers to semantic questions regarding SVG structure, edited SVGs per specified instructions, or newly synthesized SVG graphics.
Key to InternSVG's approach is the application of an extended tokenizer that introduces 55 SVG tag tokens (e.g., <svg>, <path>, <circle>, <animateTransform>) and 42 attribute tokens (e.g., viewBox, fill). These tokens directly represent SVG semantics, vastly compressing SVG sequences and facilitating more accurate code understanding and autoregressive generation. The token inventory also defines numerical tokens covering the full range required for SVG coordinate and attribute encoding, eliminating common ambiguities in parsing SVG numerical details.
2. SAgoge Dataset Structure and Diversity
The SAgoge dataset is central to InternSVG’s training. It surpasses previous datasets in scale and variety, comprising over 16 million training samples that include:
- Static SVGs: Icons, general illustrations, and scientific diagrams (notably chemical formula SVGs)
- Long-range Illustrations: Complex multi-object drawings requiring long-token sequence modeling and hierarchical reasoning
- Dynamic Animations: SVGs annotated with temporal transformations, including <animate> and <animateTransform> elements for motion synthesis
These data types are annotated for multi-level SVG structure (XML tree depth, attributes, relationships) and offer granular coverage across basic to expert-level SVG constructs. This diversity supports cross-task supervision, allowing the model to learn transferable representations and compositional reasoning that conventional, single-domain datasets do not provide.
3. SArena Benchmark Design
SArena is a tightly coupled benchmark suite that operationalizes the breadth of SAgoge’s data, providing standard protocols for measurement and comparison. It is partitioned into four principal SVG domains:
Domain | Example Task | Key Metric |
---|---|---|
Icon | Multiple-choice QA | Accuracy, PSNR, FID |
Illustration | Text/Image generation | FID, FID-CLIP, CLIPScore |
Chemistry | Formula generation | FID, CLIP-I2I, correctness |
Animation | Text/Video-to-SANI | Human preference, FID |
Understanding tasks evaluate semantic inference from SVG code, including structural and attribute reasoning. Editing tasks span both pixel-level (LPIPS, SSIM, PSNR) and semantic-level modifications. Generation tasks are measured using standard perceptual and distributional metrics including FID (Fréchet Inception Distance), CLIPScore, and FID-CLIP, with additional human preference studies for animation.
4. Model Architecture and Training Methodology
InternSVG employs a dual-stream architecture combining a large ViT-based visual encoder (InternViT-300M) and a transformer LLM backbone (Qwen2.5-7B). SVG-specific special tokens are mapped to subword embeddings using averaged subword vectors, preserving semantic priors. This subword-based embedding initialization (Editor's term) ensures rapid convergence and semantic fidelity.
Training proceeds in two progressive stages:
- Stage One: Focused on short, simpler SVGs (icons, chemistry) to anchor the model’s capability in basic semantic and structural patterns
- Stage Two: Expands to complex, long-sequence illustrations and animated SVGs, with balanced domain sampling to avoid overfitting
This staged curriculum mitigates issues of data imbalance, enables hierarchical learning, and supports scalability to increasingly sophisticated SVG scenarios.
5. Experimental Evaluation and Performance
InternSVG’s unified approach yields consistent and substantial performance improvements across domains. On SArena-Icon, InternSVG attains an understanding accuracy improvement of ~8 points over top proprietary models (e.g., Claude-Sonnet-4), higher editing PSNR, and lower generation FID. For illustration tasks, it produces sequences with optimal FID and competitive CLIPScore. Chemistry SVG generation achieves the lowest FID and highest semantic correctness due to robust representation of compound structures, outpacing prior art. In animation tasks, performance approaches the best proprietary results in both Text-to-SANI and Video-to-SANI.
Ablation studies indicate that joint training on understanding, editing, and generation yields further gains: positive task transfer among SVG understanding (QA), nuanced editing, and structured generation is empirically confirmed.
6. Applications and Implications
InternSVG’s unified modeling paradigm supports a spectrum of high-value applications:
- Digital and Web Design: Rapid, interpretable generation and editing of scalable graphics—icons, logos, and interactive illustrations
- Scientific Visualization: Synthesis of accurate chemical diagrams and technical SVGs from text descriptors
- Dynamic Graphics and Interfaces: Generation and manipulation of animated SVGs for user interface and interaction design
A plausible implication is that the unified model and SVG-specialized tokenization may foster further advances in vector-graphics reasoning within MLLMs. The curriculum training and efficient encoding strategies suggest directions for scalable, multi-domain graphic reasoning models.
7. Future Directions
InternSVG establishes a foundation for subsequent research into unified, multimodal SVG models. The scale and diversity of SAgoge, efficiency of subword-initialization, and positive cross-task transfer evidenced in experiments indicate promising avenues for fully integrated graphic code understanding and generation. Further development may extend toward more complex visual domains, higher hierarchical reasoning, and integration of dynamic scene understanding. This suggests rapid iteration and refinement of industry-standard SVG workflows is increasingly feasible with unified MLLM architectures.
InternSVG and its accompanying dataset and benchmark represent a comprehensive solution for modeling, editing, and generating SVGs, addressing fragmentation in previous approaches and establishing a scalable, interpretable, and high-performing foundation for future advancements in vector graphic AI systems (Wang et al., 13 Oct 2025).