Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 99 tok/s

Gemini 2.5 Pro 55 tok/s Pro

GPT-5 Medium 23 tok/s

GPT-5 High 19 tok/s Pro

GPT-4o 108 tok/s

GPT OSS 120B 465 tok/s Pro

Kimi K2 179 tok/s Pro

2000 character limit reached

StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis (2401.17093v1)

Published 30 Jan 2024 in cs.CV and cs.CL

Abstract: To leverage LLMs for visual synthesis, traditional methods convert raster image information into discrete grid tokens through specialized visual modules, while disrupting the model's ability to capture the true semantic representation of visual scenes. This paper posits that an alternative representation of images, vector graphics, can effectively surmount this limitation by enabling a more natural and semantically coherent segmentation of the image information. Thus, we introduce StrokeNUWA, a pioneering work exploring a better visual representation ''stroke tokens'' on vector graphics, which is inherently visual semantics rich, naturally compatible with LLMs, and highly compressed. Equipped with stroke tokens, StrokeNUWA can significantly surpass traditional LLM-based and optimization-based methods across various metrics in the vector graphic generation task. Besides, StrokeNUWA achieves up to a 94x speedup in inference over the speed of prior methods with an exceptional SVG code compression ratio of 6.9%.

Citations (6)

View on Semantic Scholar

Collections

Summary

The paper introduces stroke tokens that retain inherent visual semantics, enabling efficient SVG synthesis and a 94× inference speedup.
The methodology employs a VQ-Stroke module and an Encoder-Decoder LLM to compress vector graphics, significantly enhancing CLIP scores and SVG fidelity.
The findings pave the way for advanced applications, suggesting potential extensions into 3D visuals and real-world vectorized image synthesis.

Introduction to StrokeNUWA

Traditional approaches to visual synthesis via LLMs involve transforming rasterized images into discrete tokens with specialized visual modules. However, this method encounters two significant challenges: the reliance on specific visual modules which have limitations, and the tendency to disrupt visual semantics due to grid tokens not being inherently semantic-aware. In addressing these challenges, this paper innovates by leveraging vector graphic representations, specifically "stroke tokens," which retain richer visual semantics and are thought to naturally complement LLM processing capabilities.

Vector Graphics and Stroke Tokens

"StrokeNUWA" is introduced as an inventive venture into vector graphics synthesis. By focusing on "stroke tokens," it promises a more semantically coherent segmentation of image content. The novel representation affords multiple advantages:

Inherently Semantic: each stroke token contains intrinsic visual semantics for more intuitive image segmentation.
Natural Compatibility with LLMs: vector graphics generation mirrors sequential and interconnected nature akin to LLMs.'
High Compression: stroke encoding facilitates extreme data size reductions without compromising quality or semantic integrity.

StrokeNUWA's architecture comprises a VQ-Stroke module for compressing vector graphics to stroke tokens and an Encoder-Decoder LLM model for SVG generation. The approach not only gains in terms of representational efficacy but also offers an outstanding 94× inference speedup over prior methods with an impressive SVG code compression ratio of 6.9%.

Empirical Evaluation

Experimentally, StrokeNUWA demonstrates superiority across various measures in vector graphic synthesis tasks. Compared to traditional LLM-based and optimization-based methods, StrokeNUWA, armed with stroke tokens, produces images with higher CLIP Scores, indicative of greater semantic alignment with text prompts. It also overshadows baselines in terms of SVG fidelity and efficiency, marking a leap forward in visual synthesis.

Insights and Future Directions

The introduction of stroke tokens by StrokeNUWA marks a substantial shift in the visual synthesis paradigm, pushing the envelope for how LLMs can understand and generate visual content. It presents a fresh look at vector graphic synthesis, harmonizing with the innate strengths of LLMs to capture semantic integrity and realizing higher efficiency compared to traditional methodologies.

In future works, further refinement of stroke token quality is anticipated, potentially broadening their application scope to encompass tasks like SVG understanding, 3D domain visuals, and real-world image synthesis in a vectorized format. The paper paves the way for a new branch of research, striving for more advanced visual tokenization methods suited for LLMs and the expansion into novel application domains.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (11)

Tweets

https://twitter.com/_akhaliq/status/1752553683710054521

https://twitter.com/arankomatsuzaki/status/1752509675638038738

https://twitter.com/fly51fly/status/1752826024797360381

https://twitter.com/andrew_n_carr/status/1759952403561144692

https://twitter.com/kashifcreations/status/1752591288262914166

https://twitter.com/gm8xx8/status/1752517088277622894