ChartComplete Dataset Overview

Updated 22 January 2026

ChartComplete is a benchmark dataset featuring 30 distinct chart types defined by a modified visualization taxonomy to enhance chart-type classification.
The dataset comprises 1,500 high-quality images with a balanced distribution, collected through automated scraping and manual curation for enhanced diversity.
It supports multimodal model evaluation by utilizing t-SNE clustering and CKA heatmap analyses to demonstrate discriminative visual feature representations.

ChartComplete is a comprehensive benchmark dataset designed for inclusive chart-type classification, with a focus on supporting the development and evaluation of multimodal LLMs (MLLMs), specialized vision-LLMs, and chart understanding systems. Addressing the limitations of prior datasets, which typically encompass only a narrow subset of chart types, ChartComplete is constructed according to a taxonomy adapted from the visualization research community, comprising a broad spectrum of thirty distinct chart types. The dataset consists exclusively of chart images with explicit taxonomy labeling, does not provide direct learning signals, and is intended as a foundation for further research on chart understanding (Mustapha et al., 15 Jan 2026).

1. Taxonomy Structure and Chart Types

ChartComplete is organized using a modified version of Borkin et al.’s visualization taxonomy. The taxonomy is hierarchically grouped into twelve broad chart categories, each encompassing specific chart "terms" or types. In total, thirty chart types are represented.

Chart Type Taxonomy

Category	Example Types	Description (brief)
Area	Vanilla Area, Overlapped Area, ^{^{^{^{1^{^{^{^,}}}}}}} Proportional Area	Variants of area charts for trend and proportion visualization
Bar	Vanilla Bar, Grouped Bar, Waterfall	Bar charts for interval-based comparisons and additive contributions
Circle	Pie Chart, Donut Chart	Categorical proportion by angles or sector area
Diagram	Flow Diagram, Sankey, Timeline	Network flows, weighted paths, event ordering
Distribution	Curve Plot, Histogram, Box and Whisker	Statistical distribution visualizations
Matrix	Heat Map	Value-mapped color grids
Line	Vanilla Line, Stacked Line, Radar Chart, Surface Plot, Parallel Coordinates	Trends, radar/spider charts, high-dimensional visualization
Map	Choropleth Map, Contour Map	Geographical and spatial data depictions
Point	Scatter Plot, Stacked Scatter, Bubble Chart, Stacked Bubble	Point-based and size-encoded relations
Text	Word Cloud	Text frequency/weight visualization
Tree	Tree Map	Hierarchical data via nested rectangles
Combination	Bar and Line Chart	Composite visual encodings with shared axes

Each of the 30 chart types is precisely defined with standard visualization semantics. For instance, "Stacked Area" denotes multiple area plots stacked cumulatively, while "Sankey" specifically refers to flow diagrams where link width encodes flow quantity.

2. Dataset Construction

ChartComplete was constructed to ensure a uniform class distribution and maximized visual diversity. The collection strategy was strictly taxonomy-driven, with an explicit target of 50 high-quality images per chart type, totaling 1,500 images.

Collection Methodology

Image sources included:

Statista's "Chart of the Day": 12,635 images scraped on January 27, 2025.
Our World in Data: 4,113 images scraped through "Browse by Topic."
Manual Web Collection: Used to fill types insufficiently represented in automated scraping.

Automated pre-filtering involved extracting deep visual features using Google ViT (vit-base-patch16-224) and indexing with FAISS for efficient nearest-neighbor retrieval. A random image was sampled, its 100 nearest neighbors were retrieved, and manual inspection was performed to select images of the correct chart type until the target per class ( $C=50$ ) was reached. Manual collection completed categories with insufficient scraped samples.

Provenance Breakdown

Method	Count	% of Total
Manual	951	63.4 %
Scraped	549	36.6 %

Class Balance and Metrics

Class balance is strictly enforced: $n_i = 50$ , $\forall i \in [1,30]$ , $N = 1,500$ . The Gini-Simpson diversity index is $D = 1 - \sum_{i=1}^{30} (n_i/N)^2 = 1 - (1/30) \approx 0.967$ , indicating high class uniformity.

Visual feature diversity was measured using t-SNE on ViT embeddings, producing well-delineated clusters by chart type. A Centered Kernel Alignment (CKA) heatmap quantified pairwise feature similarity, with

$\mathrm{CKA}(K, L) = \frac{\mathrm{HSIC}(K, L)}{\sqrt{\mathrm{HSIC}(K,K) \cdot \mathrm{HSIC}(L,L)}}$

where HSIC denotes the Hilbert–Schmidt independence criterion. Most chart types show low inter-class visual feature similarity.

3. Data Format and Annotations

The dataset is organized as a directory hierarchy:

Root Level: 12 category folders (e.g., “Area,” “Bar,” “Circle”)
Category Folders: One subfolder per chart type (e.g., “VanillaArea,” “StackedArea”)
Chart-Type Folders: 50 image files, each prefixed by "collected" (manual) or "scraped" (automated) to indicate provenance

All files are in JPEG or PNG format, with resolutions $\geq$ 300 px in each dimension (no upper bound). The implicit chart-type label is furnished by the folder structure; provenance (scraped vs. collected) is encoded in the filename. No further JSON labels, bounding boxes, underlying data tables, or augmented annotations are provided.

4. Applications and Evaluation

ChartComplete is primarily intended as a benchmark for evaluating the chart-type classification capabilities of MLLMs, vision-LLMs, and specialized chart QA or summarization systems.

The dataset addresses a major limitation of prior chart benchmarks, which are restricted to a narrow selection of 5–7 types, by providing coverage of 30 types, including uncommon and composite charts. No end-to-end classification or question-answering baselines are reported by the authors. Instead, the paper details clustering experiments using t-SNE on ViT embeddings and reports a CKA heatmap analysis, demonstrating that the chart types have discriminatively distinct feature representations.

The dataset enables researchers to test model generalization across a broad range of visualization conventions and motivates the creation of more nuanced, chart-focused multimodal evaluation protocols.

5. Limitations and Future Extensions

ChartComplete currently includes only chart images with taxonomy labels and provenance. Key absences include:

No question–answer pairs, data tables, or other direct “learning signal.”
No vector/CSV underlying data, bounding boxes for chart elements, or OCR output.
Metadata is limited: only "collected" vs. "scraped" provenance; no publisher, timestamp, or additional attribution.

The authors propose several directions for dataset extension:

Chart–data alignment: Addition of underlying data tables (e.g., CSV, JSON) for each chart.
Semantic annotations: Inclusion of axis, tick, and legend bounding boxes; OCR transcripts.
Downstream benchmarks: Establishing ChartQA tasks, chart summarization, and chart-based fact-checking benchmarks.
Community contributions: Soliciting expansion of chart types, modalities, and QA pairs while retaining the current taxonomy structure.

ChartComplete is distributed under a CC BY license, and all code for collection and processing is available at https://github.com/AI-DSCHubAUB/ChartComplete-Dataset/, facilitating community adoption and further development (Mustapha et al., 15 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

ChartComplete: A Taxonomy-based Inclusive Chart Dataset (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ChartComplete Dataset.

ChartComplete Dataset Overview

1. Taxonomy Structure and Chart Types

Chart Type Taxonomy

2. Dataset Construction

Collection Methodology

Provenance Breakdown

Class Balance and Metrics

3. Data Format and Annotations

4. Applications and Evaluation

5. Limitations and Future Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ChartComplete Dataset Overview

1. Taxonomy Structure and Chart Types

Chart Type Taxonomy

2. Dataset Construction

Collection Methodology

Provenance Breakdown

Class Balance and Metrics

3. Data Format and Annotations

4. Applications and Evaluation

5. Limitations and Future Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research