SBS Figures Dataset

Updated 11 August 2025

SBS Figures Dataset is a synthetic dataset that provides densely annotated scientific charts and carefully crafted QA pairs for robust visual question answering.
It employs a three-stage pipeline combining LLM-driven topic generation with Python-based figure rendering to ensure extensive visual diversity and accuracy.
Pre-training with SBS Figures significantly enhances chart QA accuracy, bridging performance gaps in data-scarce scenarios.

The SBS Figures Dataset is a large-scale, synthetically generated resource designed for pre-training and evaluating visual question answering (QA) models on scientific figures and charts. It employs a stage-by-stage pipeline to generate chart images, complete programmatic annotations, and densely crafted QA pairs, specifically addressing the limitations and challenges posed by earlier synthetic approaches reliant on end-to-end LLM code generation. The dataset’s principal objective is to facilitate effective pre-training of visual LLMs for figure QA, yielding models that generalize robustly with limited real-world chart data.

1. Synthetic Generation Pipeline

The SBS Figures Dataset is created through a structured three-part process:

Data Generation: An LLM (GPT-3.5-turbo) is prompted to produce diverse data "topics" for targeted chart types (e.g., bar, line, and pie charts). For every topic, a JSON specification is generated representing key visual and textual attributes: titles, axis descriptions, data labels, numeric values, and color codes. Few-shot prompting with roughly ten in-domain examples per chart type is utilized to ensure consistent data formatting and maintain topic diversity.
Figure Rendering: Rather than using the LLM for code generation at every instantiation—a process prone to code errors and visual homogeneity—SBS Figures employs pre-defined, type-specific Python scripts (vetted and generated using GPT-4) to render the final images. The rendering process introduces extensive randomization across visual parameters, including but not limited to font style, title position, legend placement, marker style, and number display. Each chart type supports up to approximately 2,000 unique stylistic combinations.
QA Pair Generation: With the data fully described in JSON, QA annotations are synthesized via a second LLM step. Few-shot prompting ensures coverage across a breadth of reasoning types, ranging from data extraction (e.g., retrieving the value of a data point) to computation (e.g., summing across categories) and color reference (e.g., mapping categories to colors). QA generation leverages the inherent completeness of the annotation—bypassing any need for optical character recognition (OCR).

This modular separation—LLM for semantic content and QA, deterministic code for rendering—minimizes code failure modes and enforces visual diversity.

2. Dataset Annotation Structure

Each instance within SBS Figures contains:

JSON annotation: Encodes the full chart content.
- Text: Titles, axis labels, categorical labels.
- Numbers: Precise values for all data series or chart regions.
- Colors: Explicit color codes for major chart elements, legends, and (if applicable) data marks.
Rendered chart image: Produced from the JSON and randomized visual layout parameters.
QA pairs: Dense and programmatic, spanning derived and direct questions across all chart components.

Examples of question types include:

Data extraction: “What is the value of the highest bar?”
Color association: “Which label is represented in blue?”
Statistical reasoning: “How many categories exceed a value threshold?”

This architecture ensures dense labeling without annotation gaps and supports robust direct mapping between chart visual elements and their semantic/quantitative meaning.

3. Pre-training Utility and Transfer Effects

SBS Figures demonstrates strong utility in transfer learning for chart question answering. When models such as Donut and Pix2Struct are pre-trained solely on SBS Figures and subsequently fine-tuned on downstream datasets like ChartQA, there is a documented increase in QA accuracy—from base rates of approximately 54% (random/scratch pre-training) to over 60% with SBS Figures pre-training. These gains stem from both the diversity of graphical appearances in SBS Figures and the exhaustiveness of the annotation, which promotes the acquisition of generalizable reasoning and data extraction capabilities.

This effect is particularly salient when real-world QA data is limited, with the synthetic pre-training closing performance gaps that would otherwise require larger-scale manual annotation.

4. Addressing LLM-based Synthesis Challenges

Direct synthesis of chart visualizations and annotations using LLMs often results in two primary issues: code errors (incorrect or non-executable code for figure rendering) and low intra-dataset diversity (repetitive figure appearances). The SBS Figures pipeline mitigates these by:

Constraining LLM use to the semantic/topical level (not code or rendering).
Relying on a library of pre-tested, type-specific Python scripts for all figure creation.
Rigorous parameter randomization in rendering, driving high visual variance.
Isolating QA generation to a stage informed only by the accurate and complete JSON.

This decomposition ensures that annotation and rendering errors are minimized and the figure corpus is both semantically rich and visually heterogeneous.

5. Dataset Scale and Composition

The final dataset comprises a large volume of synthetic chart images (the precise total is not specified but is consistent with large-scale synthetic practices), each provided with the following:

Component	Annotation Method	Coverage and Diversity
Chart images	Parametric Python scripts	~2,000 visual variants/type
Data JSON	LLM (few-shot prompt)	Full structural annotation
QA pairs	LLM, programmatic	Dense, multi-type per figure

The generation pipeline is extensible and has been applied to major chart archetypes (e.g., bar, line, pie), with modularity facilitating future chart types or domain-specific figure extensions.

6. Applications and Broader Implications

SBS Figures is primarily intended for pre-training large visual-LLMs for figure QA—directly targeting tasks where deep reasoning over quantitative and textual chart content is required. Its dense, programmatically complete annotation enables:

Pre-training for high-fidelity chart question answering.
Integration into document parsing pipelines where scientific or business reporting figures play a critical role.
Potential adaptation for multimodal models that require document-level understanding incorporating figures.

A plausible implication is that the modular, stage-by-stage generation architecture could be adapted for other forms of synthetic data generation beyond charts—enabling the construction of highly annotated, diverse multisource datasets in other scientific visual genres.

7. Research Outlook and Continued Development

The stage-by-stage pipeline of SBS Figures substantially reduces manual annotation requirements, supports extensive visual diversity, solves key challenges present in naïve LLM-based generation, and delivers a powerful pre-training signal. The approach has demonstrated that synthetic data, when coupled with robust annotation and diversified rendering, can yield models with both strong generalization and rapid adaptability in data-constrained target domains. Extensions may include expansion to further figure types, integration with real-world hybrid datasets, and application in broader multimodal understanding scenarios.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to SBS Figures Dataset.