Papers
Topics
Authors
Recent
Search
2000 character limit reached

ChartComplete Dataset Overview

Updated 22 January 2026
  • ChartComplete is a benchmark dataset featuring 30 distinct chart types defined by a modified visualization taxonomy to enhance chart-type classification.
  • The dataset comprises 1,500 high-quality images with a balanced distribution, collected through automated scraping and manual curation for enhanced diversity.
  • It supports multimodal model evaluation by utilizing t-SNE clustering and CKA heatmap analyses to demonstrate discriminative visual feature representations.

ChartComplete is a comprehensive benchmark dataset designed for inclusive chart-type classification, with a focus on supporting the development and evaluation of multimodal LLMs (MLLMs), specialized vision-LLMs, and chart understanding systems. Addressing the limitations of prior datasets, which typically encompass only a narrow subset of chart types, ChartComplete is constructed according to a taxonomy adapted from the visualization research community, comprising a broad spectrum of thirty distinct chart types. The dataset consists exclusively of chart images with explicit taxonomy labeling, does not provide direct learning signals, and is intended as a foundation for further research on chart understanding (Mustapha et al., 15 Jan 2026).

1. Taxonomy Structure and Chart Types

ChartComplete is organized using a modified version of Borkin et al.’s visualization taxonomy. The taxonomy is hierarchically grouped into twelve broad chart categories, each encompassing specific chart "terms" or types. In total, thirty chart types are represented.

Chart Type Taxonomy

Category Example Types Description (brief)
Area Vanilla Area, Overlapped Area, 1, Proportional Area Variants of area charts for trend and proportion visualization
Bar Vanilla Bar, Grouped Bar, Waterfall Bar charts for interval-based comparisons and additive contributions
Circle Pie Chart, Donut Chart Categorical proportion by angles or sector area
Diagram Flow Diagram, Sankey, Timeline Network flows, weighted paths, event ordering
Distribution Curve Plot, Histogram, Box and Whisker Statistical distribution visualizations
Matrix Heat Map Value-mapped color grids
Line Vanilla Line, Stacked Line, Radar Chart, Surface Plot, Parallel Coordinates Trends, radar/spider charts, high-dimensional visualization
Map Choropleth Map, Contour Map Geographical and spatial data depictions
Point Scatter Plot, Stacked Scatter, Bubble Chart, Stacked Bubble Point-based and size-encoded relations
Text Word Cloud Text frequency/weight visualization
Tree Tree Map Hierarchical data via nested rectangles
Combination Bar and Line Chart Composite visual encodings with shared axes

Each of the 30 chart types is precisely defined with standard visualization semantics. For instance, "Stacked Area" denotes multiple area plots stacked cumulatively, while "Sankey" specifically refers to flow diagrams where link width encodes flow quantity.

2. Dataset Construction

ChartComplete was constructed to ensure a uniform class distribution and maximized visual diversity. The collection strategy was strictly taxonomy-driven, with an explicit target of 50 high-quality images per chart type, totaling 1,500 images.

Collection Methodology

Image sources included:

  • Statista's "Chart of the Day": 12,635 images scraped on January 27, 2025.
  • Our World in Data: 4,113 images scraped through "Browse by Topic."
  • Manual Web Collection: Used to fill types insufficiently represented in automated scraping.

Automated pre-filtering involved extracting deep visual features using Google ViT (vit-base-patch16-224) and indexing with FAISS for efficient nearest-neighbor retrieval. A random image was sampled, its 100 nearest neighbors were retrieved, and manual inspection was performed to select images of the correct chart type until the target per class (C=50C=50) was reached. Manual collection completed categories with insufficient scraped samples.

Provenance Breakdown

Method Count % of Total
Manual 951 63.4 %
Scraped 549 36.6 %

Class Balance and Metrics

Class balance is strictly enforced: ni=50n_i = 50, i[1,30]\forall i \in [1,30], N=1,500N = 1,500. The Gini-Simpson diversity index is D=1i=130(ni/N)2=1(1/30)0.967D = 1 - \sum_{i=1}^{30} (n_i/N)^2 = 1 - (1/30) \approx 0.967, indicating high class uniformity.

Visual feature diversity was measured using t-SNE on ViT embeddings, producing well-delineated clusters by chart type. A Centered Kernel Alignment (CKA) heatmap quantified pairwise feature similarity, with

CKA(K,L)=HSIC(K,L)HSIC(K,K)HSIC(L,L)\mathrm{CKA}(K, L) = \frac{\mathrm{HSIC}(K, L)}{\sqrt{\mathrm{HSIC}(K,K) \cdot \mathrm{HSIC}(L,L)}}

where HSIC denotes the Hilbert–Schmidt independence criterion. Most chart types show low inter-class visual feature similarity.

3. Data Format and Annotations

The dataset is organized as a directory hierarchy:

  • Root Level: 12 category folders (e.g., “Area,” “Bar,” “Circle”)
  • Category Folders: One subfolder per chart type (e.g., “VanillaArea,” “StackedArea”)
  • Chart-Type Folders: 50 image files, each prefixed by "collected" (manual) or "scraped" (automated) to indicate provenance

All files are in JPEG or PNG format, with resolutions \geq 300 px in each dimension (no upper bound). The implicit chart-type label is furnished by the folder structure; provenance (scraped vs. collected) is encoded in the filename. No further JSON labels, bounding boxes, underlying data tables, or augmented annotations are provided.

4. Applications and Evaluation

ChartComplete is primarily intended as a benchmark for evaluating the chart-type classification capabilities of MLLMs, vision-LLMs, and specialized chart QA or summarization systems.

The dataset addresses a major limitation of prior chart benchmarks, which are restricted to a narrow selection of 5–7 types, by providing coverage of 30 types, including uncommon and composite charts. No end-to-end classification or question-answering baselines are reported by the authors. Instead, the paper details clustering experiments using t-SNE on ViT embeddings and reports a CKA heatmap analysis, demonstrating that the chart types have discriminatively distinct feature representations.

The dataset enables researchers to test model generalization across a broad range of visualization conventions and motivates the creation of more nuanced, chart-focused multimodal evaluation protocols.

5. Limitations and Future Extensions

ChartComplete currently includes only chart images with taxonomy labels and provenance. Key absences include:

  • No question–answer pairs, data tables, or other direct “learning signal.”
  • No vector/CSV underlying data, bounding boxes for chart elements, or OCR output.
  • Metadata is limited: only "collected" vs. "scraped" provenance; no publisher, timestamp, or additional attribution.

The authors propose several directions for dataset extension:

  • Chart–data alignment: Addition of underlying data tables (e.g., CSV, JSON) for each chart.
  • Semantic annotations: Inclusion of axis, tick, and legend bounding boxes; OCR transcripts.
  • Downstream benchmarks: Establishing ChartQA tasks, chart summarization, and chart-based fact-checking benchmarks.
  • Community contributions: Soliciting expansion of chart types, modalities, and QA pairs while retaining the current taxonomy structure.

ChartComplete is distributed under a CC BY license, and all code for collection and processing is available at https://github.com/AI-DSCHubAUB/ChartComplete-Dataset/, facilitating community adoption and further development (Mustapha et al., 15 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ChartComplete Dataset.