Rendering-Aware Reinforcement Learning for Vector Graphics Generation (2505.20793v1)

Published 27 May 2025 in cs.CV and cs.AI

Abstract: Scalable Vector Graphics (SVG) offer a powerful format for representing visual designs as interpretable code. Recent advances in vision-LLMs (VLMs) have enabled high-quality SVG generation by framing the problem as a code generation task and leveraging large-scale pretraining. VLMs are particularly suitable for this task as they capture both global semantics and fine-grained visual patterns, while transferring knowledge across vision, natural language, and code domains. However, existing VLM approaches often struggle to produce faithful and efficient SVGs because they never observe the rendered images during training. Although differentiable rendering for autoregressive SVG code generation remains unavailable, rendered outputs can still be compared to original inputs, enabling evaluative feedback suitable for reinforcement learning (RL). We introduce RLRF(Reinforcement Learning from Rendering Feedback), an RL method that enhances SVG generation in autoregressive VLMs by leveraging feedback from rendered SVG outputs. Given an input image, the model generates SVG roll-outs that are rendered and compared to the original image to compute a reward. This visual fidelity feedback guides the model toward producing more accurate, efficient, and semantically coherent SVGs. RLRF significantly outperforms supervised fine-tuning, addressing common failure modes and enabling precise, high-quality SVG generation with strong structural understanding and generalization.

Summary

The paper introduces RLRF, a two-stage approach that integrates rendering feedback into SVG generation via supervised fine-tuning and reinforcement learning.
It leverages a composite reward function—combining pixel reconstruction, semantic similarity, and code efficiency—to enhance visual fidelity and reduce token length.
Experiments on Im2SVG and Text2SVG tasks demonstrate significant improvements in metrics and robustness to out-of-distribution inputs.

The paper "Rendering-Aware Reinforcement Learning for Vector Graphics Generation" (2505.20793) introduces RLRF (Reinforcement Learning from Rendering Feedback), a novel method to improve the performance of autoregressive Vision-LLMs (VLMs) in generating Scalable Vector Graphics (SVG) from images or text. Existing VLM approaches typically frame SVG generation as a code generation task and are trained using supervised learning on tokenized SVG sequences. While this approach is effective for learning syntactic correctness and basic visual patterns, it suffers from a critical limitation: the model never observes or evaluates the rendered visual output of the generated SVG code during training. This lack of rendering awareness leads to common failure modes like hallucination, looping, and poor generalization to complex or out-of-distribution inputs, as token-level losses do not capture visual fidelity or structural coherence in the rendered image.

The core challenge is that the SVG rendering process, especially in the context of autoregressive token generation, is non-differentiable. Unlike differentiable rasterizers that work with continuous primitive parameters, VLMs generate discrete tokens one by one, making gradient propagation through the rendering step impossible. RLRF addresses this by leveraging the rendered output as evaluative feedback suitable for reinforcement learning (RL).

The proposed approach utilizes a two-stage training process. The first stage involves Supervised Fine-Tuning (SFT) on paired image-SVG data (Im2SVG) or text-SVG data, adapting a base VLM (like Qwen2.5-VL) to the SVG generation domain. This stage minimizes the negative log-likelihood of the ground truth SVG tokens given the input condition (image or text), effectively teaching the model basic SVG syntax and structure. The objective for SFT is:

$\mathcal{L}_{SFT}(\theta) = \mathbb{E}_{x_c \sim \mathcal{D}} \left[ -\log p_\theta(x_s \mid x_c)\right]$

where $x_c$ is the input, $x_s$ is the ground truth SVG token sequence, and $\theta$ are model parameters.

The second stage, RLRF, refines the SFT-trained model using RL. The model's conditional distribution $p_\theta(\cdot \mid x_c)$ is treated as a stochastic policy. For a given input $x_c$ , the model samples multiple SVG "rollouts" $o \sim p_\theta(\cdot \mid x_c)$ . Each sampled SVG rollout is then rendered into a pixel image, which is compared against the input image (for Im2SVG) or evaluated against the text prompt (for Text2SVG) to compute a scalar reward $R(x_c, o)$ . This reward signal provides rendering-aware feedback to guide the model's learning. The RL objective is to maximize the expected reward while maintaining proximity to the SFT policy to prevent catastrophic forgetting:

$\mathcal{J}_{\text{RL}}(\theta) = \mathbb{E}_{o\sim p_{\theta}}\bigl[R(x_{c},o)\bigr] -\beta D_{\text{KL}}(p_{\theta}\,\|\,p_{\theta}_{\text{sft}})$

where $p_{\theta_{\text{sft}}}$ is the frozen SFT model, and $\beta$ is a KL regularization coefficient. The paper adopts Group Relative Policy Optimization (GRPO) to optimize this objective, which uses group-centered advantages within a batch to reduce variance without requiring a separate value network.

A key contribution is the design of a composite reward function that integrates multiple signals:

Image Reconstruction Rewards: Measure pixel-level similarity between the rendered SVG and the input image using metrics like L2 distance. An edge-aware variant (L2 Canny) is also used to emphasize structural alignment.

$R_{\text{img}} = \operatorname{clip}\left( 1 - \frac{1}{N} \left\lVert I^{\text{norm}}_{\text{in}} - I^{\text{norm}}_{\text{pred}} \right\rVert_2^2,\, -1,\, 1 \right)$

where $I^{\text{norm}}$ are normalized images.
Semantic Similarity Rewards: Assess high-level perceptual and semantic alignment using models like DreamSim [fu2023dreamsim] or CLIP [radford2021learning]. For Text2SVG, CLIP similarity between the text prompt and rendered image is used, along with a VLM-based judge for more nuanced evaluation of text-image alignment and aesthetics.
Code Efficiency Rewards: Penalize excessively long or redundant SVG code relative to ground truth, encouraging compactness.

$R_{\text{len}} = 1 - \left( \frac{1}{ L_{\text{gt}} } \max\left(0,\, L_{\text{pred}} - \frac{L_{\text{gt}}}{2} \right) \right)^2$

This penalizes lengths beyond half the ground truth length.

The final reward is a weighted sum of these components: $R_{\text{total}} = \sum_{i=1}^{K} w_i R_i$ . CairoSVG is used for rendering.

Experiments were conducted primarily on the Im2SVG task using Qwen2.5-VL models (3B and 7B) and StarVector-1B, fine-tuned on subsets of the SVG-Stack dataset [rodriguez2025starvector]. A challenging test set, SVG-Stack-Hard, was curated for evaluation. Results demonstrate that RLRF significantly outperforms the SVG-SFT baseline and other open/closed-source VLMs on metrics like MSE, SSIM, DINO, and LPIPS, while also improving code efficiency (reducing token length). For instance, RLRF reduced MSE for Qwen2.5VL-7B from 8.60 (SFT) to 4.01, while improving SSIM, DINO, and LPIPS.

The paper also explores Text2SVG using Qwen3-8B and text-only caption datasets (Flickr30k, MM-Icons) with VLM-based rewards, showing RLRF enables the model to generalize to generating SVGs from text prompts without direct SVG supervision.

Ablation studies highlight the importance of the two-stage training (SFT is crucial for building initial SVG proficiency), diverse rollouts (higher temperature and more rollouts improve exploration), and the composite reward function (combining pixel, semantic, and length rewards yields the best overall performance). The paper also found that removing the KL divergence term improved training stability and reward progression, suggesting that the rendering-based rewards provide sufficient regularization. Potential reward hacking instances (like exploiting small viewboxes or using the <text> primitive) were identified and mitigated by controlling rendering resolution and preprocessing SVGs before evaluation.

RLRF models show strong generalization to out-of-distribution datasets like SVG-Emoji and SVG-Fonts, outperforming baselines trained only with SFT, indicating that rendering-aware training instills a deeper understanding of SVG structure and visual representation.

The implementation involves using standard VLM architectures, leveraging libraries like LLaMA-Factory [zheng2024llamafactory] for SFT, and building RL training on frameworks like EasyR1 [zheng2025easyr1] and VERL [sheng2024hybridflow]. vLLM [kwon2023efficient] is used for efficient rollout generation. Training uses AdamW, bf16 precision, gradient clipping, and FSDP. The vision encoder is frozen during SFT and RLRF. A dynamic maximum length schedule helps manage unproductive rollouts.

Limitations include dependence on model context length (though improving with recent models), potential loss of general instruction-following ability due to specialization, and the inefficiency of GRPO training bottlenecked by rollout generation.

Overall, RLRF provides a practical and effective method for integrating rendering feedback into the training of autoregressive VLMs for vector graphics generation, leading to models that produce more visually faithful, semantically coherent, and efficient SVG code. The approach is presented as potentially generalizable to other inverse rendering code generation tasks.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/joanrod_ai/status/1927808876537168356

https://twitter.com/RishavPramanik/status/1927871501908283781

YouTube

Show All Videos