StarVector: Generating Scalable Vector Graphics Code from Images and Text (2312.11556v3)

Published 17 Dec 2023 in cs.CV, cs.AI, and cs.CL

Abstract: Scalable Vector Graphics (SVGs) are vital for modern image rendering due to their scalability and versatility. Previous SVG generation methods have focused on curve-based vectorization, lacking semantic understanding, often producing artifacts, and struggling with SVG primitives beyond path curves. To address these issues, we introduce StarVector, a multimodal LLM for SVG generation. It performs image vectorization by understanding image semantics and using SVG primitives for compact, precise outputs. Unlike traditional methods, StarVector works directly in the SVG code space, leveraging visual understanding to apply accurate SVG primitives. To train StarVector, we create SVG-Stack, a diverse dataset of 2M samples that enables generalization across vectorization tasks and precise use of primitives like ellipses, polygons, and text. We address challenges in SVG evaluation, showing that pixel-based metrics like MSE fail to capture the unique qualities of vector graphics. We introduce SVG-Bench, a benchmark across 10 datasets, and 3 tasks: Image-to-SVG, Text-to-SVG generation, and diagram generation. Using this setup, StarVector achieves state-of-the-art performance, producing more compact and semantically rich SVGs.

PDF HTML Abstract

Understanding StarVector: A New Approach to Scalable Vector Graphics Generation

The Challenge of SVG Generation

Scalable Vector Graphics (SVG) have become ubiquitous in digital applications, cherished for their ability to scale without loss of resolution, their editability, and compact file sizes. SVGs have particularly thrived in web development, enabling efficient rendering and file compression, and in graphic design, where they support the creation of intricate designs that retain fidelity at any size. However, generating complex SVGs has been a longstanding challenge in the field of AI—traditional methods have stumbled with complexity and have often been restricted to working with oversimplified SVGs that necessitate significant post-processing to achieve desired results.

Introducing StarVector

The innovation brought by StarVector addresses the generation of unrestricted SVG code directly from pixel-based images. It employs the CLIP image encoder to capture visual representations, transforming them into visual tokens using an adapter. The visual tokens, combined with SVG token embeddings, are processed by a code generation LLM (CodeLLM) known as StarCoder, which predicts the subsequent token in the SVG sequence, effectively aligning visual detail with SVG code elements.

By including a CLIP image encoder in its architecture, StarVector unites the realms of vision and language, translating visual elements into SVG code with proficiency. The model has been assessed using SVG-Bench, a comprehensive evaluation framework encompassing multiple datasets and metrics to measure SVG synthesis methods' effectiveness.

Contributions and Results

The contributions of the paper are:

The development of StarVector, a powerful model for SVG generation that integrates vision and LLMs.
The establishment of SVG-Bench, a unified evaluation suite that also includes two novel datasets: SVG-Emoji and SVG-Stack.
Comprehensive testing across SVG-Bench shows StarVector's remarkable ability to generalize to complex SVGs, highlighting the benefits of pre-training on SVG-Stack for improved model performance.

The experiments demonstrate StarVector's ability to outperform current approaches and its significant improvements in handling visual quality and complexity.

A New Era for SVG Modeling

StarVector represents a pivotal step in SVG generation technology. By successfully bypassing previous limitations, future research directions are sparked, such as extending to natural image-to-SVG conversion, text-to-SVG generation, and augmented editing capabilities. The work establishes a new benchmark for SVG technology, potentially revolutionizing this domain by supporting the creation of more complex and high-quality vector images.