Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 56 tok/s

Gemini 2.5 Pro 39 tok/s Pro

GPT-5 Medium 15 tok/s Pro

GPT-5 High 16 tok/s Pro

GPT-4o 99 tok/s Pro

Kimi K2 155 tok/s Pro

GPT OSS 120B 476 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

Symbolic Graphics Programming with Large Language Models (2509.05208v1)

Published 5 Sep 2025 in cs.CV and cs.LG

Abstract: LLMs excel at program synthesis, yet their ability to produce symbolic graphics programs (SGPs) that render into precise visual content remains underexplored. We study symbolic graphics programming, where the goal is to generate an SGP from a natural-language description. This task also serves as a lens into how LLMs understand the visual world by prompting them to generate images rendered from SGPs. Among various SGPs, our paper sticks to scalable vector graphics (SVGs). We begin by examining the extent to which LLMs can generate SGPs. To this end, we introduce SGP-GenBench, a comprehensive benchmark covering object fidelity, scene fidelity, and compositionality (attribute binding, spatial relations, numeracy). On SGP-GenBench, we discover that frontier proprietary models substantially outperform open-source models, and performance correlates well with general coding capabilities. Motivated by this gap, we aim to improve LLMs' ability to generate SGPs. We propose a reinforcement learning (RL) with verifiable rewards approach, where a format-validity gate ensures renderable SVG, and a cross-modal reward aligns text and the rendered image via strong vision encoders (e.g., SigLIP for text-image and DINO for image-image). Applied to Qwen-2.5-7B, our method substantially improves SVG generation quality and semantics, achieving performance on par with frontier systems. We further analyze training dynamics, showing that RL induces (i) finer decomposition of objects into controllable primitives and (ii) contextual details that improve scene coherence. Our results demonstrate that symbolic graphics programming offers a precise and interpretable lens on cross-modal grounding.

Collections

Summary

The paper introduces a comprehensive benchmark, SGP-GenBench, to systematically evaluate LLMs' SVG generation capabilities.
The paper demonstrates that a reinforcement learning framework with cross-modal verifiable rewards significantly improves compositional accuracy and semantic alignment.
The paper highlights that RL-tuned open-source LLMs can achieve near state-of-the-art performance in generating detailed, compositional SVG programs.

Symbolic Graphics Programming with LLMs: Methods, Benchmarks, and Reinforcement Learning

Introduction

This work systematically investigates the capacity of LLMs to generate symbolic graphics programs (SGPs), specifically focusing on scalable vector graphics (SVG) as the target representation. The paper introduces a comprehensive benchmark (SGP-GenBench) for evaluating SGP generation, analyzes the performance gap between open-source and proprietary LLMs, and proposes a reinforcement learning (RL) framework with cross-modal verifiable rewards to enhance SVG generation quality. The research demonstrates that RL post-training, guided by vision-language foundation models, can substantially improve the SVG generation capabilities of open-source LLMs, closing much of the gap with state-of-the-art proprietary systems.

Figure 1: Qualitative results of symbolic graphics programming. RL with verifiable reward enables Qwen-2.5-7B to acquire compositional drawing ability and produce semantically accurate symbolic graphics programs.

Symbolic Graphics Programming: Motivation and Properties

Symbolic graphics programming is formulated as the task of generating a formal, structured graphics program (e.g., SVG code) from a natural language prompt. Unlike pixel-based text-to-image (T2I) generation, SGP generation is inherently interpretable and verifiable, as the output code can be deterministically rendered and analyzed for semantic correctness.

SGPs possess two critical properties:

Parametric Precision: SVGs encode geometry and appearance via explicit parameters (coordinates, radii, colors, transforms), enabling fine-grained control over spatial and visual attributes. This allows LLMs with strong symbolic reasoning to generate graphics with high geometric fidelity, a capability that remains challenging for T2I models.
Procedural Compositionality: SVGs are constructed hierarchically from primitives, supporting modular scene composition and manipulation. This procedural nature facilitates the decomposition of complex scenes into controllable elements, supporting compositional reasoning and structured scene synthesis.
Figure 2: (a) SGPs provide precise symbolic controllability, e.g., generating a regular octagon with exact geometry, outperforming T2I models. (b) Procedural generation enables hierarchical scene construction.

SGP-GenBench: Benchmarking SGP Generation

SGP-GenBench is introduced as a large-scale benchmark to systematically evaluate LLMs' SGP generation capabilities. It comprises three components:

Scene Generation (COCO-val): 1,024 prompts from MS-COCO captions, targeting complex multi-object scenes.
Object Generation (SGP-Object-val): 930 prompts for single-object fidelity.
Compositional Generation (SGP-CompBench): 3,200 prompts assessing attribute binding, spatial relations, and numeracy.

Evaluation metrics include CLIP-Score, DINO-Score, VQA-Score, and HPS v2 for semantic fidelity, as well as judge-model-based compositionality scores.

Figure 3: Overview of SGP-GenBench and representative prompt examples.

Benchmark Results

Proprietary LLMs (e.g., Claude 3.7, Gemini 2.5 Pro) consistently outperform open-source models across all metrics.
Performance on SGP-GenBench correlates with general code generation ability.
Open-source models exhibit significant deficits in both syntactic validity and semantic alignment of SVGs.

To address the performance gap, the paper proposes an RL framework that leverages cross-modal alignment rewards derived from vision-language foundation models (e.g., SigLIP, DINOv2). The RL objective is to maximize the expected reward, which is a function of:

Format Validity: Ensures syntactic correctness and renderability of SVG code.
Text-Image Alignment: Measures semantic alignment between the prompt and the rendered image using contrastive models (e.g., SigLIP).
Image-Image Alignment: When reference images are available, DINOv2-based similarity is used to assess visual fidelity.

The RL algorithm (GRPO, a critic-free PPO variant) samples multiple SVG candidates per prompt, computes rewards, and updates the policy accordingly.

Figure 4: RL pipeline: SVGs are sampled, rendered, scored for alignment, and used to update the model via policy gradients.

Key Advantages

No Ground Truth SGPs Required: The RL approach does not require paired image-program data, enabling scalable training using only captions and images.
Implicit Distillation: The LLM internalizes visual priors and semantic grounding from vision foundation models, improving cross-modal alignment.
Rule-Based Verifiability: Format and semantic correctness are enforced via explicit verifiers, reducing reward hacking.

Experimental Results

Quantitative Performance

RL post-training on Qwen-2.5-7B yields substantial improvements:

Compositional Score: Increases from 8.8 to 60.8 on SGP-CompBench, outperforming all other open-source models.
VQA Score: Achieves 0.596, surpassing all proprietary models.
CLIP/DINO/HPS: RL-tuned models approach or match proprietary systems on multiple metrics.
Figure 5: Qualitative comparison: RL-trained model generates SVGs with detail and naturalness comparable to state-of-the-art commercial models.

Compositional Generalization

Attribute Binding: RL-tuned models excel at color and shape binding but lag on texture, reflecting SVG's limitations in encoding complex textures.
Spatial Relations: 2D and implicit relations are handled well; 3D relations remain challenging due to SVG's rendering order constraints.
Numeracy: Models achieve high accuracy in total object counts but struggle with precise per-item counts, indicating ongoing challenges in fine-grained compositional reasoning.

Analysis of Training Dynamics and Emergent Behaviors

Best-of-N Sampling vs. RL

RL training yields improvements that cannot be matched by naive Best-of-N sampling within practical compute budgets. The RL-trained model's Best-of-N curve is consistently higher, and matching its performance would require sampling orders of magnitude more candidates.

Figure 6: Best-of- $N$ metrics vs. $\log N$ : RL-trained models outperform Best-of-N sampling baselines by a wide margin.

Evolution of Generation Strategies

Increased Complexity: Both the number of SVG elements and code length increase during training, indicating richer scene decomposition and more detailed outputs.

Figure 7: Evolution of elements count: RL induces longer, more complex SVG programs.

Finer Decomposition: The model learns to break down complex objects into more granular components, improving semantic accuracy and spatial precision.
Figure 8: Early-stage models use coarse decomposition; later-stage models achieve fine-grained, semantically accurate part segmentation.
Contextual Optional Details: RL-trained models begin to add plausible, unprompted details (e.g., sprinkles on a cake, waves at a beach), enhancing scene naturalness and completeness.
Figure 9: Optional details (e.g., sprinkles, waves) are generated without explicit prompting, improving scene realism.
Comment-to-Element Ratio: The ratio of explanatory comments to SVG elements increases, reflecting more explicit reasoning and decomposition.

Figure 10: Comment-to-element ratio increases with training, indicating finer-grained decomposition.

Element Type Distribution: The set of SVG primitives used remains stable, with improvements arising from better composition rather than expanded vocabulary.
Figure 11: SVG element type distribution remains stable; improvements are due to compositional strategies.
ViewBox Framing: The model learns to exploit SVG's viewBox for compositional framing, drawing objects that extend beyond the frame and relying on renderer clipping for clean outputs.

Figure 12: Model draws objects extending outside the viewBox; renderer clips excess, yielding neatly framed images.

Reward Model Effects

Color Preferences: CLIP-based rewards bias the model toward canonical color names, while SigLIP-based rewards induce more nuanced, low-saturation color choices, suggesting that reward model selection can influence stylistic properties of generated graphics.

Ablation Studies

Embedding Model Choice: SigLIP outperforms CLIP for text-image alignment, but larger encoders do not always yield better results.
Chain-of-Thought Prompting: Explicit CoT is not essential for quantitative performance; SVG code itself serves as an implicit reasoning trace.
RL Algorithm: GRPO outperforms standard PPO in alignment and semantic accuracy, with PPO yielding higher diversity.
Training Data Mixture: Balanced mixtures of natural-image and SVG-caption data yield the best generalization; single-domain training leads to overfitting.

Implications and Future Directions

Practical Implications

Open-Source Model Enhancement: RL with cross-modal verifiable rewards enables open-source LLMs to approach proprietary model performance in SGP generation, democratizing access to high-quality symbolic graphics synthesis.
Data Efficiency: The approach obviates the need for large-scale paired program-image datasets, facilitating scalable training on internet-scale data.
Interpretable Cross-Modal Grounding: SGP generation provides a transparent lens for analyzing and improving cross-modal alignment in LLMs.

Theoretical Implications

Emergent Compositionality: RL induces emergent behaviors such as finer decomposition and contextual detail generation, suggesting that verifiable reward signals can elicit structured reasoning strategies in LLMs.
Reward Model Influence: The choice of vision-language reward model can shape not only semantic alignment but also stylistic properties of generated outputs.

Future Directions

Adaptive Curriculum Learning: Dynamic adjustment of prompt complexity and reward shaping could further enhance compositional generalization.
Transfer to Broader Reasoning Tasks: Investigating whether improved SGP generation skills transfer to other domains requiring structured reasoning.
Internal Process Analysis: Probing the evolution of internal representations and reasoning traces during RL training.

Conclusion

This paper establishes a rigorous framework for evaluating and improving symbolic graphics programming in LLMs. By introducing SGP-GenBench and leveraging RL with cross-modal verifiable rewards, the research demonstrates that open-source LLMs can be post-trained to generate semantically accurate, compositional, and detailed SVGs, rivaling proprietary systems. The findings highlight the value of interpretable, verifiable program synthesis as a testbed for cross-modal grounding and structured reasoning in LLMs, and open new avenues for scalable, data-efficient enhancement of generative models in vision-language domains.