Chart-to-Text Task: Advances and Challenges

Updated 15 August 2025

Chart-to-text is the process of generating natural language summaries from chart images by extracting and aligning visual and textual data.
Modern methods employ transformer-based models, sophisticated OCR, and multimodal fusion to improve accuracy and minimize hallucinations.
Key challenges include ensuring factual consistency, handling complex chart patterns, and mitigating biases while enhancing accessibility and real-time performance.

The chart-to-text task concerns the automatic generation of natural language summaries or alternative textual descriptions that faithfully represent the content and insight embedded in chart images. This domain spans the extraction of chart data and semantics, reasoning about visual and textual features, and the synthesis of concise, informative, and accessible textual output. Research progress is characterized by the evolution from rule-based systems to end-to-end deep learning and large vision–LLMs, with increasing focus on data diversity, factual grounding, accessibility, and equitable narrative generation.

1. Task Definition and Problem Landscape

The principal goal of the chart-to-text task is to consume a chart—typically in the form of a raster image (PNG or similar) or structured visual representation (scene graph, data table)—and produce a natural language description summarizing its patterns, statistics, and insights. Output may range from alt-texts optimized for accessibility to semantically rich, multi-sentence narratives suitable for data journalism or scientific reporting. Challenges central to this task include:

Extracting and aligning multimodal information from both graphical and textual elements (e.g., bar shapes, colors, axis labels, legends).
Correctly identifying core patterns (trends, outliers, comparisons) rather than surface visual features.
Handling a wide spectrum of chart types (e.g., bar, line, pie, stacked, area, scatter, more specialized forms).
Minimizing hallucinated content and ensuring factual consistency between input and generated text.

Corpus size, input format (image-only versus data table or both), and chart complexity (simple vs. multi-series, annotated or textless) further stratify the problem setup.

2. Algorithmic Approaches and System Architectures

Architectural innovation for chart-to-text has followed a clear trajectory:

Early Pipeline Approaches: Systems such as Chart-Text utilize cascaded modules for chart type classification (MobileNet CNN), text detection (Faster R-CNN), OCR (Tesseract), and bespoke image processing for bar/pie chart region analysis. Extracted data is slotted into rigid text templates (Balaji et al., 2018).
Encoder–Decoder Models: Data-to-text and table-to-text formulations use transformer-based encoder–decoder architectures. Encoder inputs are linearized records, typically tuples containing column headers, values, positional indices, and chart type. Key advances include variable substitution to reduce hallucinations and explicit content selection modules that learn which records should be verbalized (Obeid et al., 2020).
Vision–Language Multimodal Pretraining: Models such as ChartReader (Cheng et al., 2023), UniChart (Masry et al., 2023), and ChartT5 (Zhou et al., 2023) use pre-trained backbones (e.g., Swin Transformers for image encoding, BART or T5 for language decoding), frequently augmented with structural or positional embeddings derived from chart layouts. Cross-modal fusion is realized via concatenation of visual and textual features, transformer attention layers, or plug-in projector modules (as in ChartAdapter (Xu et al., 2024)).
Advanced Alignments: Recent, large-scale systems (ChartAssistant (Meng et al., 2024), AskChart (Yang et al., 2024)) leverage mixture-of-experts (MoE) architectures, custom chart-to-table pretraining, and sophisticated chain-of-thought instruction-tuning. MoE layers dynamically select expert submodules for modality-specific reasoning, maintaining efficiency and high generalization.
Instruction-tuned LLMs: ChartLlama (Han et al., 2023), utilizing extensive synthetic instruction–chart–summary datasets, adapts LLaVA-1.5 and CLIP-based encoders to handle arbitrarily diverse chart styles and instruction types.

Techniques for explicit reasoning over visual features (mChartQA (Wei et al., 2024)) and plug-in scene graph or data table parsing (VisText (Tang et al., 2023)) further expand system capabilities to encompass color-sensitive, textless, and highly complex chart layouts.

3. Datasets and Benchmarking

Research progress is underpinned by increasingly large and diverse datasets:

Dataset/Benchmark	Chart Types	Size	Distinctive Features / Use Cases
Statista / Pew (C2T)	Bar, line, pie	44,096	Table+image+summary triplets, annotation via OCR+human
ChartSumm	Bar, line, pie	84,363	Metadata-rich, multi-length summaries, Bengali extension
ChartSFT (ChartAst)	Broad (incl. radar, bubble)	39M	Multitask, instruction-following, reasoning, large scale
ChartBank (AskChart)	Many	7.5M	Visual prompts, OCR-aware, CoT for table translation
AutoChart	Bar, line, scatter	10,232	Analytical rhetorical moves, Brownian trend synthesis
VisText	Bar, line, area	12,441	Scene graph, data table, rasterized images, error taxonomy

This expansion in scale and task diversity (including multi-sentence summaries, question answering, and visual reasoning) drives generalization and robustness. Chart-to-text evaluations increasingly employ BLEU, ROUGE, CIDEr, Content Selection, and human metric scores, sometimes augmented with custom content alignment or semantically sensitive measures.

4. Key Methodological Innovations

Advances in the chart-to-text domain rest on several critical methodological innovations:

Variable Substitution: Replacing raw chart value mentions with data variables during training/inference, enabling models to ground facts and prevent broad hallucinations (Obeid et al., 2020, Cheng et al., 2023).
Position and Type Embeddings: Encoders for visual elements embed positional, typological, and appearance features, ensuring sequential and spatial relationships are preserved (Cheng et al., 2023).
Scene Text Copy Mechanisms: Models such as ChartT5 embed “OCR sentinels,” allowing the system to effectively constrain output tokens to those verified by scene text detection (Zhou et al., 2023).
Chart-Specific Pretraining: Objectives include masked header/value prediction (MHP/MVP), chart-to-table translation, and low- and high-level reasoning tasks, improving both low-level extraction and high-level summarization (Masry et al., 2023, Zhou et al., 2023, Meng et al., 2024).
Mixture of Experts (MoE) Routing: Expert selection per token or modality in AskChart enables efficient representation while preserving performance, outperforming monolithic model architectures (Yang et al., 2024).
Visual–Textual Alignment and Enhanced Prompting: Systems like ChartAdapter (Xu et al., 2024) leverage cross-modal alignment projectors and learnable query vectors to fuse and decode latent semantic features for downstream summarization.

Formally, common loss functions combine auto-regressive sequence generation terms with auxiliary expert routing and alignment objectives: $\mathcal{L} = \mathcal{L}_{\text{reg}} + \lambda \mathcal{L}_{\text{aux}}$

Other key approaches include chain-of-thought (CoT) augmentation for enhanced stepwise reasoning, text and visual prompt pairing for multi-turn instruction following, and rigorous ablation on component efficacy.

5. Challenges, Limitations, and Open Issues

Despite notable improvements, several limitations persist:

Factual Consistency: Even state-of-the-art systems exhibit hallucinations, mislabeling, or failure to align descriptions with chart content—especially under noisy OCR or unusual design variations (Kantharaj et al., 2022, Rahman et al., 2023).
Complex Reasoning and Trends: Models often struggle to interpret and explain complex or non-obvious patterns (e.g., multi-series interactions, marginal outliers, causality, or trend stability) (Kantharaj et al., 2022, Tang et al., 2023).
Robustness to Visual Diversity: The performance gap remains pronounced when transferring from synthetic/procedurally generated charts to visually enhanced or domain-specific real-world charts (Balaji et al., 2018).
Explicit Bias: Recent work identifies systemic geo-economic sentiment bias in generated summaries, where VLMs assign more positive language to charts labeled with high-income versus low-/middle-income country designators, even for graphically identical inputs (Mahbub et al., 13 Aug 2025).
Resource Requirements: Many top-performing systems require substantial computational resources, and large parameter counts remain a barrier to real-time or edge deployment, although techniques such as MoE routing are mitigating these concerns (Yang et al., 2024).
Grounded Reasoning: While variable substitution and table-based grounding have reduced hallucinations, end-to-end systems (especially those relying on images/OCR alone) still find it difficult to maintain accurate, interpretable reasoning across long or multi-turn outputs.

Mitigation attempts—such as adversarial prompt “distractors” to nudge sentiment, or multi-agent system proposals—only partially resolve these limitations and highlight the need for robust, dataset-level or architectural bias control.

6. Practical Impact, Applications, and Future Directions

Chart-to-text methodologies have direct implications for accessibility, data journalism, scientific publishing, and analytical tool development:

Accessibility: Automated chart description systems (e.g., Chart-Text (Balaji et al., 2018), ChartAdapter (Xu et al., 2024)) support screen readers and enable visually impaired users to engage with data representations that are otherwise inaccessible.
Interactive and Embedded Analytics: Descriptive text generation enables integration with business dashboards, research tools, and reporting pipelines. Instruction-following and multi-turn support foster conversational and exploratory data analysis (Han et al., 2023, Meng et al., 2024).
Localization and Task-Driven Generation: Optimization for accessibility and regional bias correction (ChartOptimiser (Wang et al., 14 Apr 2025), ChartifyText (Zhang et al., 2024)) allows for context-sensitive chart design and summary narrative generation, covering linguistic, cognitive, and cultural factors.
Explainability and Visualization Design: Simulation of human-like scanpaths and memory decay (Chartist (Shi et al., 5 Feb 2025)) can inform explainable AI and visualization design, enabling feedback on which chart components are most influential in driving interpretation.

Emerging research directions include:

Bias Mitigation and Equity: Systematic, data- and model-level interventions to guard against geo-economic and other societal biases (Mahbub et al., 13 Aug 2025).
Unified Multimodal Reasoning: Systems that align textual, layout, and graphical cues in a domain-agnostic fashion—especially via joint vision–language–table modeling (Meng et al., 2024, Cheng et al., 2023).
Multilingual and Cross-Domain Generalization: Expansion to new languages (ChartSumm Bengali evaluation (Rahman et al., 2023)) and chart types, robust generalization to domain-specific, graphically diverse, or highly compositional visualizations.
Human-in-the-Loop and Mixed-Initiative Systems: Leveraging human input for semantic customization and interactive refinement of generated captions to reduce residual error and unwanted bias (Tang et al., 2023).
Data-Efficient and Low-Resource Approaches: Research on reducing pretraining requirements, improving model compactness, and maintaining accuracy across resource-constrained deployments (Yang et al., 2024).

7. Summary Table of Recent Chart-to-Text Systems

System / Paper	Core Technique	Notable Features / Innovations
Chart-Text (Balaji et al., 2018)	CNN + Faster R-CNN, OCR, templates	End-to-end Alt-Text, image processing, supports 5 chart types
Chart2Text (Obeid et al., 2020)	Transformer encoder–decoder	Variable substitution, explicit content selection, new dataset
ChartReader (Cheng et al., 2023)	Transformer, positional embeddings	Center/keypoint grouping, rule-free detection, data vars
ChartT5 (Zhou et al., 2023)	Mask R-CNN, multi-modal encoder–decoder	Masked header/value prediction, Scene Text Copy
ChartAdapter (Xu et al., 2024)	Query vectors, cross-modal projection	Contrastive alignment, hierarchical training, joint tuning
ChartAst (Meng et al., 2024)	BART, Donut/SPHINX backbone	Chart-to-table pre-training, multi-task instruction tuning
AskChart (Yang et al., 2024)	LLM (MoE), vision encoder, OCR	MoE routing, explicit textual cue enhancement, ChartBank data
ChartLlama (Han et al., 2023)	LLaVA-1.5, ViT-L/14, LoRA	Synthetic GPT-4 data, long summary support, task diversity
mChartQA (Wei et al., 2024)	Vision encoder, cross-attention, LLM	Dual-phase alignment/reasoning, handling of color/textless charts
ChartifyText (Zhang et al., 2024)	LLM-based data inference/generation	Uncertainty, sentiment, missing data encoding, text-to-chart
VisText (Tang et al., 2023)	Scene graph/data table/ByT5/VL-T5	Semantic prefix-tuning, error taxonomy, multi-level captions

This table elucidates the distinctive contributions and technical strategies of each system, grounding advances in their architectural and data innovations.

The chart-to-text landscape continues to evolve rapidly, driven by advances in multimodal modeling, dataset scale, and unified task formulations. The integration of bias analysis and mitigation, cross-modal fusion, efficient model scaling, and human-in-the-loop refinement represents the next frontier in generating fair, faithful, and accessible natural language summaries from visual data representations.