ChartComplete Dataset Overview
- ChartComplete is a benchmark dataset featuring 30 distinct chart types defined by a modified visualization taxonomy to enhance chart-type classification.
- The dataset comprises 1,500 high-quality images with a balanced distribution, collected through automated scraping and manual curation for enhanced diversity.
- It supports multimodal model evaluation by utilizing t-SNE clustering and CKA heatmap analyses to demonstrate discriminative visual feature representations.
ChartComplete is a comprehensive benchmark dataset designed for inclusive chart-type classification, with a focus on supporting the development and evaluation of multimodal LLMs (MLLMs), specialized vision-LLMs, and chart understanding systems. Addressing the limitations of prior datasets, which typically encompass only a narrow subset of chart types, ChartComplete is constructed according to a taxonomy adapted from the visualization research community, comprising a broad spectrum of thirty distinct chart types. The dataset consists exclusively of chart images with explicit taxonomy labeling, does not provide direct learning signals, and is intended as a foundation for further research on chart understanding (Mustapha et al., 15 Jan 2026).
1. Taxonomy Structure and Chart Types
ChartComplete is organized using a modified version of Borkin et al.’s visualization taxonomy. The taxonomy is hierarchically grouped into twelve broad chart categories, each encompassing specific chart "terms" or types. In total, thirty chart types are represented.
Chart Type Taxonomy
| Category | Example Types | Description (brief) |
|---|---|---|
| Area | Vanilla Area, Overlapped Area, 1, Proportional Area | Variants of area charts for trend and proportion visualization |
| Bar | Vanilla Bar, Grouped Bar, Waterfall | Bar charts for interval-based comparisons and additive contributions |
| Circle | Pie Chart, Donut Chart | Categorical proportion by angles or sector area |
| Diagram | Flow Diagram, Sankey, Timeline | Network flows, weighted paths, event ordering |
| Distribution | Curve Plot, Histogram, Box and Whisker | Statistical distribution visualizations |
| Matrix | Heat Map | Value-mapped color grids |
| Line | Vanilla Line, Stacked Line, Radar Chart, Surface Plot, Parallel Coordinates | Trends, radar/spider charts, high-dimensional visualization |
| Map | Choropleth Map, Contour Map | Geographical and spatial data depictions |
| Point | Scatter Plot, Stacked Scatter, Bubble Chart, Stacked Bubble | Point-based and size-encoded relations |
| Text | Word Cloud | Text frequency/weight visualization |
| Tree | Tree Map | Hierarchical data via nested rectangles |
| Combination | Bar and Line Chart | Composite visual encodings with shared axes |
Each of the 30 chart types is precisely defined with standard visualization semantics. For instance, "Stacked Area" denotes multiple area plots stacked cumulatively, while "Sankey" specifically refers to flow diagrams where link width encodes flow quantity.
2. Dataset Construction
ChartComplete was constructed to ensure a uniform class distribution and maximized visual diversity. The collection strategy was strictly taxonomy-driven, with an explicit target of 50 high-quality images per chart type, totaling 1,500 images.
Collection Methodology
Image sources included:
- Statista's "Chart of the Day": 12,635 images scraped on January 27, 2025.
- Our World in Data: 4,113 images scraped through "Browse by Topic."
- Manual Web Collection: Used to fill types insufficiently represented in automated scraping.
Automated pre-filtering involved extracting deep visual features using Google ViT (vit-base-patch16-224) and indexing with FAISS for efficient nearest-neighbor retrieval. A random image was sampled, its 100 nearest neighbors were retrieved, and manual inspection was performed to select images of the correct chart type until the target per class () was reached. Manual collection completed categories with insufficient scraped samples.
Provenance Breakdown
| Method | Count | % of Total |
|---|---|---|
| Manual | 951 | 63.4 % |
| Scraped | 549 | 36.6 % |
Class Balance and Metrics
Class balance is strictly enforced: , , . The Gini-Simpson diversity index is , indicating high class uniformity.
Visual feature diversity was measured using t-SNE on ViT embeddings, producing well-delineated clusters by chart type. A Centered Kernel Alignment (CKA) heatmap quantified pairwise feature similarity, with
where HSIC denotes the Hilbert–Schmidt independence criterion. Most chart types show low inter-class visual feature similarity.
3. Data Format and Annotations
The dataset is organized as a directory hierarchy:
- Root Level: 12 category folders (e.g., “Area,” “Bar,” “Circle”)
- Category Folders: One subfolder per chart type (e.g., “VanillaArea,” “StackedArea”)
- Chart-Type Folders: 50 image files, each prefixed by "collected" (manual) or "scraped" (automated) to indicate provenance
All files are in JPEG or PNG format, with resolutions 300 px in each dimension (no upper bound). The implicit chart-type label is furnished by the folder structure; provenance (scraped vs. collected) is encoded in the filename. No further JSON labels, bounding boxes, underlying data tables, or augmented annotations are provided.
4. Applications and Evaluation
ChartComplete is primarily intended as a benchmark for evaluating the chart-type classification capabilities of MLLMs, vision-LLMs, and specialized chart QA or summarization systems.
The dataset addresses a major limitation of prior chart benchmarks, which are restricted to a narrow selection of 5–7 types, by providing coverage of 30 types, including uncommon and composite charts. No end-to-end classification or question-answering baselines are reported by the authors. Instead, the paper details clustering experiments using t-SNE on ViT embeddings and reports a CKA heatmap analysis, demonstrating that the chart types have discriminatively distinct feature representations.
The dataset enables researchers to test model generalization across a broad range of visualization conventions and motivates the creation of more nuanced, chart-focused multimodal evaluation protocols.
5. Limitations and Future Extensions
ChartComplete currently includes only chart images with taxonomy labels and provenance. Key absences include:
- No question–answer pairs, data tables, or other direct “learning signal.”
- No vector/CSV underlying data, bounding boxes for chart elements, or OCR output.
- Metadata is limited: only "collected" vs. "scraped" provenance; no publisher, timestamp, or additional attribution.
The authors propose several directions for dataset extension:
- Chart–data alignment: Addition of underlying data tables (e.g., CSV, JSON) for each chart.
- Semantic annotations: Inclusion of axis, tick, and legend bounding boxes; OCR transcripts.
- Downstream benchmarks: Establishing ChartQA tasks, chart summarization, and chart-based fact-checking benchmarks.
- Community contributions: Soliciting expansion of chart types, modalities, and QA pairs while retaining the current taxonomy structure.
ChartComplete is distributed under a CC BY license, and all code for collection and processing is available at https://github.com/AI-DSCHubAUB/ChartComplete-Dataset/, facilitating community adoption and further development (Mustapha et al., 15 Jan 2026).