Can LLMs Understand Symbolic Graphics Programs?
Introduction
The paper "Can LLMs Understand Symbolic Graphics Programs?" introduces a novel task aimed at evaluating the capability of LLMs to interpret and reason about symbolic graphics programs. These programs, which procedurally generate visual data, present a unique challenge distinct from conventional text or code due to their inherent visual semantics. The task requires LLMs to semantically understand the corresponding rendered image from symbolic program input, which necessitates a form of "visual imagination."
Methodology
The authors propose a task that measures an LLM’s ability to answer questions regarding the visual content generated by symbolic graphics programs. This task is materialized by two main components: a benchmark called SGP-Bench and a novel finetuning method named Symbolic Instruction Tuning (SIT).
Benchmark Creation
SGP-Bench evaluates the semantic understanding of symbolic graphics programs through various types of questions:
- SVG Programs: These are representative of 2D vector graphics.
- CAD Programs: These represent 2D/3D objects using computer-aided design.
Two types of evaluations are incorporated:
- Semantic Understanding: This involves answering semantic questions based on symbolic program input.
- Semantic Consistency: This measures the consistency of the answers when the symbolic programs undergo spatial transformations (translations and rotations).
Evaluation
The authors benchmark a range of both commercial and open-source LLMs, analyzing their semantic understanding and consistency across various transformations. The dataset consists of two main types:
- SVG Dataset: Contains 1,085 programs with questions covering different semantic aspects such as shape, color, and reasoning.
- CAD Dataset: Comprises 2,400 programs from various datasets, formatted as 3D, 3D_reconstruction, and 2D sketches.
Results
The results showcase the following key points:
- Scaling Law Observance: Larger models within the same family exhibit better performance, confirming the scaling law.
- Performance Variability: Proprietary models like GPT-4 demonstrate superior performance compared to open-source counterparts.
- Understanding Capabilities: Even leading models show suboptimal performance, particularly in challenging tasks like the SGP-MNIST dataset, where understanding SVG programs representing MNIST digits remains close to chance-level accuracy. This emphasizes the complexity of symbolic program interpretation.
Table: Summary of LLM Performance on SGP-Bench
| Model | SVG Avg | CAD Avg | SVG Invariance | CAD Invariance | |-|||-|-| | GPT-3.5 Turbo | 0.498 | 0.576 | 0.897 | 0.870 | | GPT-4 Turbo | 0.609 | 0.716 | 0.867 | 0.835 | | GPT-4o Mini | 0.585 | 0.659 | 0.881 | 0.852 | | GPT-4o | 0.633 | 0.733 | 0.878 | 0.844 | | Claude 3.5 | 0.674 | 0.742 | 0.903 | 0.870 |
Symbolic Instruction Tuning (SIT)
To improve the semantic understanding of LLMs, the authors introduce SIT. This technique involves finetuning LLMs with an instruction-following dataset generated via vision-LLMs like GPT-4o. The SIT dataset constitutes 55K symbolic programs with detailed semantic descriptions of the rendered images.
Table: Impact of SIT on Model Performance
| Dataset Size | Accuracy (Llama3-8B) | Accuracy (Gemma-7B) | |--|-|| | Original | 43.20 | 39.33 | | SIT-10k | 48.16 | 45.60 | | SIT-25k | 51.43 | 46.87 | | SIT-40k | 45.62 | 45.21 | | SIT-55k | 40.99 | 47.28 |
Implications and Future Directions
The research opens several new avenues:
- Visual Reasoning: It highlights the potential for LLMs to perform visual reasoning tasks without direct visual inputs.
- Benchmarking Rigorousness: SGP-Bench provides a robust means to distinguish between models proficient in symbolic program understanding.
- Future Research: Further exploration into combining different symbolic representations and improving SIT datasets could significantly enhance LLM capabilities.
While the current models demonstrate promise, considerable improvements are necessary to approach human-like understanding. Future works might focus on enriching training data, refining finetuning processes, and investigating the underlying mechanisms behind LLM reasoning capabilities.
Conclusion
The paper underscores a critical evaluation area for LLMs by introducing a benchmark tailored to symbolic graphics programs. Despite the progress, the paper reveals that current LLMs have significant room for growth in understanding symbolic visual representations. Symbolic Instruction Tuning presents a promising direction, showing measurable improvements in LLM performance on this challenging task. As such, this research forms a foundational step towards enhancing the visual reasoning abilities of LLMs.