Can Large Language Models Understand Symbolic Graphics Programs? (2408.08313v3)

Published 15 Aug 2024 in cs.LG, cs.AI, cs.CL, and cs.CV

Abstract: Against the backdrop of enthusiasm for LLMs, there is an urgent need to scientifically assess their capabilities and shortcomings. This is nontrivial in part because it is difficult to find tasks which the models have not encountered during training. Utilizing symbolic graphics programs, we propose a domain well-suited to test multiple spatial-semantic reasoning skills of LLMs. Popular in computer graphics, these programs procedurally generate visual data. While LLMs exhibit impressive skills in general program synthesis and analysis, symbolic graphics programs offer a new layer of evaluation: they allow us to test an LLM's ability to answer different-grained semantic-level questions of the images or 3D geometries without a vision encoder. To semantically understand the symbolic programs, LLMs would need to possess the ability to "imagine" and reason how the corresponding graphics content would look with only the symbolic description. We use this task to evaluate LLMs by creating a large benchmark for the semantic visual understanding of symbolic graphics programs, built procedurally with minimal human effort. Particular emphasis is placed on transformations of images that leave the image level semantics invariant while introducing significant changes to the underlying program. We evaluate commercial and open-source LLMs on our benchmark to assess their ability to reason about visual output of programs, finding that LLMs considered stronger at reasoning generally perform better. Lastly, we introduce a novel method to improve this ability -- Symbolic Instruction Tuning (SIT), in which the LLM is finetuned with pre-collected instruction data on symbolic graphics programs. Interestingly, we find that SIT not only improves LLM's understanding on symbolic programs, but it also improves general reasoning ability on various other benchmarks.

PDF HTML Abstract

Can LLMs Understand Symbolic Graphics Programs?

Introduction

The paper "Can LLMs Understand Symbolic Graphics Programs?" introduces a novel task aimed at evaluating the capability of LLMs to interpret and reason about symbolic graphics programs. These programs, which procedurally generate visual data, present a unique challenge distinct from conventional text or code due to their inherent visual semantics. The task requires LLMs to semantically understand the corresponding rendered image from symbolic program input, which necessitates a form of "visual imagination."

Methodology

The authors propose a task that measures an LLM’s ability to answer questions regarding the visual content generated by symbolic graphics programs. This task is materialized by two main components: a benchmark called SGP-Bench and a novel finetuning method named Symbolic Instruction Tuning (SIT).

Benchmark Creation

SGP-Bench evaluates the semantic understanding of symbolic graphics programs through various types of questions:

SVG Programs: These are representative of 2D vector graphics.
CAD Programs: These represent 2D/3D objects using computer-aided design.

Two types of evaluations are incorporated:

Semantic Understanding: This involves answering semantic questions based on symbolic program input.
Semantic Consistency: This measures the consistency of the answers when the symbolic programs undergo spatial transformations (translations and rotations).

Evaluation

The authors benchmark a range of both commercial and open-source LLMs, analyzing their semantic understanding and consistency across various transformations. The dataset consists of two main types:

SVG Dataset: Contains 1,085 programs with questions covering different semantic aspects such as shape, color, and reasoning.
CAD Dataset: Comprises 2,400 programs from various datasets, formatted as 3D, 3D_reconstruction, and 2D sketches.

Results

The results showcase the following key points:

Scaling Law Observance: Larger models within the same family exhibit better performance, confirming the scaling law.
Performance Variability: Proprietary models like GPT-4 demonstrate superior performance compared to open-source counterparts.
Understanding Capabilities: Even leading models show suboptimal performance, particularly in challenging tasks like the SGP-MNIST dataset, where understanding SVG programs representing MNIST digits remains close to chance-level accuracy. This emphasizes the complexity of symbolic program interpretation.

Table: Summary of LLM Performance on SGP-Bench

| Model | SVG Avg | CAD Avg | SVG Invariance | CAD Invariance | |-|||-|-| | GPT-3.5 Turbo | 0.498 | 0.576 | 0.897 | 0.870 | | GPT-4 Turbo | 0.609 | 0.716 | 0.867 | 0.835 | | GPT-4o Mini | 0.585 | 0.659 | 0.881 | 0.852 | | GPT-4o | 0.633 | 0.733 | 0.878 | 0.844 | | Claude 3.5 | 0.674 | 0.742 | 0.903 | 0.870 |

Symbolic Instruction Tuning (SIT)

To improve the semantic understanding of LLMs, the authors introduce SIT. This technique involves finetuning LLMs with an instruction-following dataset generated via vision-LLMs like GPT-4o. The SIT dataset constitutes 55K symbolic programs with detailed semantic descriptions of the rendered images.

Table: Impact of SIT on Model Performance

| Dataset Size | Accuracy (Llama3-8B) | Accuracy (Gemma-7B) | |--|-|| | Original | 43.20 | 39.33 | | SIT-10k | 48.16 | 45.60 | | SIT-25k | 51.43 | 46.87 | | SIT-40k | 45.62 | 45.21 | | SIT-55k | 40.99 | 47.28 |

Implications and Future Directions

The research opens several new avenues:

Visual Reasoning: It highlights the potential for LLMs to perform visual reasoning tasks without direct visual inputs.
Benchmarking Rigorousness: SGP-Bench provides a robust means to distinguish between models proficient in symbolic program understanding.
Future Research: Further exploration into combining different symbolic representations and improving SIT datasets could significantly enhance LLM capabilities.

While the current models demonstrate promise, considerable improvements are necessary to approach human-like understanding. Future works might focus on enriching training data, refining finetuning processes, and investigating the underlying mechanisms behind LLM reasoning capabilities.

Conclusion

The paper underscores a critical evaluation area for LLMs by introducing a benchmark tailored to symbolic graphics programs. Despite the progress, the paper reveals that current LLMs have significant room for growth in understanding symbolic visual representations. Symbolic Instruction Tuning presents a promising direction, showing measurable improvements in LLM performance on this challenging task. As such, this research forms a foundational step towards enhancing the visual reasoning abilities of LLMs.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Zeju Qiu (7 papers)
Weiyang Liu (83 papers)
Haiwen Feng (16 papers)
Zhen Liu (234 papers)
Tim Z. Xiao (16 papers)
Katherine M. Collins (32 papers)
Joshua B. Tenenbaum (257 papers)
Adrian Weller (150 papers)
Michael J. Black (163 papers)
Bernhard Schölkopf (412 papers)

Citations (4)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/Besteuler/status/1824442445482123533

https://twitter.com/gklambauer/status/1824331308262187492

https://twitter.com/Besteuler/status/1844742876355280917

https://twitter.com/Besteuler/status/1844915329140469944

https://twitter.com/devangt/status/1824592079600881990

https://twitter.com/jtoy/status/1824510359539196131