Chart Grounding: Data & Visual Extraction
- Chart grounding is the process of converting chart images into machine-readable structured data and visual attributes, enabling accurate chart interpretation.
- It employs methods like data table extraction, color mapping, and legend localization to reconstruct chart semantics and support downstream tasks.
- Specialized architectures and benchmarks enhance grounding fidelity, facilitating tasks such as chart-to-code reconstruction and multi-chart comparison.
Chart grounding is the process of extracting both the underlying structured data and associated visual-semantic attributes from a chart image. The goal is to enable precise, interpretable, and automatable understanding of charts by decomposing their visual structure into machine-readable representations. Chart grounding is foundational for a range of downstream chart analysis tasks, including chart-to-table extraction, chart-to-code reconstruction, element localization, visually grounded question answering, and multi-chart comparison. The field has crystallized around benchmarks, evaluation protocols, and specialized modeling approaches that emphasize structural fidelity, attribute extraction, fine-grained alignment, and robust visual reasoning, especially in the face of complex chart designs and domain shifts (Bansal et al., 30 Oct 2025).
1. Formal Definition, Task Scope, and Benchmarks
Chart grounding requires converting a raster image of a chart into a structured output that typically includes:
- Tabular data : Every row-column-value triplet reflecting the data depicted in the chart
- Visual attributes : Including categorical color encodings, legend spatial position (from a discrete grid), and region-level text style parameters (font size, family, weight in points and string labels)
Formally, the function representing chart grounding is:
where is the extracted data table and is the set of visual/structural attributes (Bansal et al., 30 Oct 2025).
Benchmarks such as ChartAB (Bansal et al., 30 Oct 2025), ChartAnchor (Li et al., 30 Nov 2025), and ChartGen (Kondic et al., 31 May 2025) extend this basic task to include chart-to-code mapping (i.e., from a chart image to executable plotting code ), dense element localization (bounding box of every visual mark), and multi-chart dense alignment. Chart grounding is thus not only about element-wise labeling but also about holistic symbolic reconstruction—capturing semantics, layout, and perceptual consistency simultaneously.
2. Grounding Subtasks and Representation Schemes
Chart grounding is decomposed into several subtasks with precisely defined input-output schemas:
- Data Grounding: Given a chart image, extract the underlying CSV data table:
1 2 3 4
Country,Year,Value USA,2020,5.4 China,2020,7.1 India,2020,2.8
- Color Grounding: JSON mapping from legend/category to hex or RGB color code:
1
{"color_map": [{"series": "Product A", "color": "#FF5733"}, ...]} - Legend Grounding: Detect legend position from a 9-cell spatial grid:
1
{"legend_position": "upper right"} - Text-Style Grounding: For chart textual regions (title, legend, axis labels/ticks), identify font size (int), weight (light, normal, bold), and family.
(Bansal et al., 30 Oct 2025)1 2 3 4
{ "title": {"size": 14, "weight": "bold", "family": "Times New Roman"}, ... }
Several benchmarks require chart-to-code translation, where the output is an executable script whose rendering closely reproduces the original chart (Li et al., 30 Nov 2025, Kondic et al., 31 May 2025).
Knowledge-graph-based representations such as ChartKG explicitly model visual elements, variable correspondences, visual encodings, and high-level insights (e.g., higherThan, trendOverTime) as directed triples in a semantic graph (Zhou et al., 13 Oct 2024). This graph-based schema anchors raw perceptual elements to semantic roles and supports retrieval as well as structured question answering.
3. Evaluation Metrics and Fidelity Assessment
Evaluation of chart grounding is multi-dimensional, encompassing data fidelity, semantic and stylistic correctness, spatial accuracy, and perceptual similarity:
- Data Table Extraction: SCRM (structural precision/recall/F1 of cell matches) (Bansal et al., 30 Oct 2025)
- Color Grounding: Mean RGB error over all series
- Legend and Text-Style: Discrete classification accuracy
- Chart-to-Code: Blend of execution success, tuple-based F1 for data, text component match (legend, title, axes), layout alignment, and CIEDE2000 color difference (Li et al., 30 Nov 2025, Kondic et al., 31 May 2025)
- Bounding Box/Localization: Intersection over Union (IoU):
- Dense Alignment (Chart Pairs): Decomposition into key identification and value precision, followed by an overall alignment score in (Bansal et al., 30 Oct 2025).
- Semantic Graphs: Subgraph isomorphism and tuple matching, as well as downstream QA accuracy (often >90% for knowledge-graph lookups) (Zhou et al., 13 Oct 2024).
- Qualitative Fidelity: CLIPScore embedding similarity and visual inspection of rendered outputs (Li et al., 30 Nov 2025).
4. Modeling Approaches and Architectures
Chart grounding models employ a spectrum of architectural paradigms:
- Encoder-Decoder with Chart-Specific Pretraining: Models such as UniChart (Masry et al., 2023) and ChartAssistant (Meng et al., 4 Jan 2024) use a Swin Transformer vision encoder fused with a BART or LLaMA-based decoder, trained on chart-to-table, chart-to-text, and chart QA tasks.
- Two-Stage Extraction-and-Alignment: A first stage grounds atomic elements, the second stage aligns and compares (for multi-chart tasks); this separation reduces hallucinations (Bansal et al., 30 Oct 2025).
- Chain-of-Thought with Visual Reflection: ChartPoint's PointCoT and ChartSketcher's Sketch-CoT interleave reasoning steps with explicit visual annotation (bounding boxes, sketches) fed back into the model for iterative refinement (Xu et al., 29 Nov 2025, Huang et al., 25 May 2025).
- Agentic Pipelines: Multi-agent and tool-augmented models (e.g., ChartAgent, ChartCitor, Socratic Chart) decompose grounding and reasoning into submodules or actions, invoking cropping, segmentation, code extraction, or SVG generation as needed (Kaur et al., 6 Oct 2025, Goswami et al., 3 Feb 2025, Ji et al., 14 Apr 2025).
- Code-driven Reconstruction: Models such as ChartGen and ChartReasoner translate chart images to executable code and/or structured ECharts specifications, often using post-hoc or back-propagated reasoning for supporting tasks (Kondic et al., 31 May 2025, Jia et al., 11 Jun 2025).
- Pixel-to-Sequence with Explicit Localization: RefChartQA and DOGE inject coordinate tokens or bounding box tags directly into the output stream, allowing autoregressive models to jointly emit answers and supporting regions (Vogel et al., 29 Mar 2025, Zhou et al., 26 Nov 2024).
- Component Segmentation and Deformable Attention: Approaches such as ChartFormer isolate fine-grained chart components (bars, axes, legends), fusing them with question representations using question-guided deformable co-attention for robust grounding (Zheng et al., 19 Jul 2024).
5. Empirical Findings, Failure Modes, and Model Limitations
Experimental results across recent benchmarks highlight strengths and persistent challenges:
- Complexity Effect: Data and color grounding performance drops sharply on 3D, radar, box, and multi-axis charts (<3/10 alignment score) versus bar/line (>6/10) (Bansal et al., 30 Oct 2025).
- Fine-Grained Attributes: Text-style extraction is especially poor (<20% accuracy for font size/family), and models systematically misclassify legend positions depending on architecture and pretraining bias (Bansal et al., 30 Oct 2025).
- Color Discrimination Weakness: Median RGB errors >50 indicate weak color shade discrimination (Bansal et al., 30 Oct 2025); CIEDE2000 color fidelity for chart-to-code rarely exceeds 40% (Li et al., 30 Nov 2025).
- Scaling Laws: Model scaling improves alignment and grounding (except for text-style attributes), but architectural differences remain decisive (Li et al., 30 Nov 2025).
- Spatial Reasoning Gaps: Depth perception, legend/text perturbation, 3D alignment, and implicit visual cues (rose/polar charts) remain limiting factors (Bansal et al., 30 Oct 2025).
- QA Performance Tied to Grounding: Better grounding improves downstream QA accuracy by up to 20%; poor grounding increases hallucinations (Bansal et al., 30 Oct 2025, Vogel et al., 29 Mar 2025).
- Code Reconstruction: Even top open-weight VLMs plateau at ≈0.58 data fidelity and ≈7.5/10 image similarity on chart-to-code tasks (Kondic et al., 31 May 2025).
| Chart Type | Simple (Bar/Line) | Complex (3D/etc.) | |-----------------|-------------------|-------------------| | Alignment Score | >6/10 | <3/10 |
Failures typically stem from low-resolution or atypical charts, overlapping elements, erroneous implicit label handling, and underrepresentation of rare chart types. Grounding is more robust to color changes than to legend/text-style variation, and specialized module integration (e.g., chart-type-specific segmentation) can mitigate some weaknesses (Bansal et al., 30 Oct 2025, Zheng et al., 19 Jul 2024).
6. Recommendations and Emerging Directions
Contemporary literature recommends the following to advance chart grounding:
- Explicit Grounding Modules: Integrate high-accuracy OCR, visual element detectors, and JSON-based template outputs. Architectures should favor modular grounding primitives rather than end-to-end black boxes.
- Synthetic Data Augmentation: Use large-scale synthetic chart corpora to expand color palettes, spatial/legend/textural variation, and chart-type diversity (Bansal et al., 30 Oct 2025, Kondic et al., 31 May 2025).
- Spatial Reasoning Priors: Incorporate priors for 3D geometry, polar coordinates, and complex layouts (Bansal et al., 30 Oct 2025, Ji et al., 14 Apr 2025).
- Instruction Tuning on Structure-Aware Outputs: Fine-tune models to reliably emit structured JSON or code representations suitable for downstream automation (Bansal et al., 30 Oct 2025, Li et al., 30 Nov 2025).
- Hybrid Objectives: Combine code, table, and perceptual targets to align symbolic, numerical, and stylistic fidelity (Li et al., 30 Nov 2025).
- Scene-Graph and Knowledge-Graph Representations: Adopt graph-based grounding for fine-grained retrieval, fact attribution, and interpretable QA over chart images (Zhou et al., 13 Oct 2024).
- Interactive and Dynamic Chart Support: Progress toward segmenting and grounding elements in interactive, animated, or dashboard-type charts remains open.
- Enhanced Localization: Move beyond bounding boxes toward segmentation masks and key-point grounding for non-rectilinear and dense layouts (Vogel et al., 29 Mar 2025).
- Agentic and Tool-Augmented Reasoning: Tool-callable MLLMs that reason iteratively and manipulate visual domains via cropping, segmentation, and SVG annotation exhibit superior robustness to missing labels and perturbations (Kaur et al., 6 Oct 2025, Ji et al., 14 Apr 2025).
Future work is expected to focus on increasing numerical and semantic precision, unifying chart-to-code and chart-to-table under shared optimization, and developing explicit alignment modules that bridge raw vision and symbolic output (Li et al., 30 Nov 2025, Kondic et al., 31 May 2025, Bansal et al., 30 Oct 2025).