EvoChart: A Benchmark and a Self-Training Approach Towards Real-World Chart Understanding (2409.01577v2)
Abstract: Chart understanding enables automated data analysis for humans, which requires models to achieve highly accurate visual comprehension. While existing Visual LLMs (VLMs) have shown progress in chart understanding, the lack of high-quality training data and comprehensive evaluation benchmarks hinders VLM chart comprehension. In this paper, we introduce EvoChart, a novel self-training method for generating synthetic chart data to enhance VLMs' capabilities in real-world chart comprehension. We also propose EvoChart-QA, a noval benchmark for measuring models' chart comprehension abilities in real-world scenarios. Specifically, EvoChart is a unique self-training data synthesis approach that simultaneously produces high-quality training corpus and a high-performance chart understanding model. EvoChart-QA consists of 650 distinct real-world charts collected from 140 different websites and 1,250 expert-curated questions that focus on chart understanding. Experimental results on various open-source and proprietary VLMs tested on EvoChart-QA demonstrate that even the best proprietary model, GPT-4o, achieves only 49.8% accuracy. Moreover, the EvoChart method significantly boosts the performance of open-source VLMs on real-world chart understanding tasks, achieving 54.2% accuracy on EvoChart-QA.
- Phi-3 technical report: A highly capable language model locally on your phone, 2024.
- Realcqa: Scientific chart question answering as a test-bed for first-order logic. In ICDAR, pages 14189: 66–83, 2023.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023.
- Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023.
- Chartreader: A unified framework for chart derendering and comprehension without heuristic rules. In ICCV, pages 22145–22156, 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023.
- Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
- Chartllama: A multimodal LLM for chart understanding and generation. arXiv preprint arXiv:2311.16483, 2023.
- DVQA: understanding data visualizations via question answering. In CVPR, pages 5648–5656, 2018.
- Answering questions about data visualizations using efficient bimodal fusion. In WACV, pages 1487–1496, 2020.
- Figureqa: An annotated figure dataset for visual reasoning. In ICLR, 2018.
- Echarts: A declarative framework for rapid construction of web-based visualization. VI, page 2(2): 136–146, 2018.
- Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895, 2024.
- SPHINX: the joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
- Matcha: Enhancing visual language pretraining with math reasoning and chart derendering. In ACL, pages 12756–12770, 2023.
- Improved baselines with visual instruction tuning. In CVPR, pages 26296–26306, 2024.
- Visual instruction tuning. In NeurIPS, 2023.
- Unichart: A universal vision-language pretrained model for chart comprehension and reasoning. In EMNLP, pages 14662–14684, 2023.
- Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In Findings of ACL, pages 2263–2279, 2022.
- Chartinstruct: Instruction tuning for chart comprehension and reasoning. arXiv preprint arXiv:2403.09028, 2024.
- Chartgemma: Visual instruction-tuning for chart reasoning in the wild, 2024.
- Chartassisstant: A universal chart multimodal language model via chart-to-table pre-training and multitask instruction tuning. arXiv preprint arXiv: 2401.02384, 2024.
- Plotqa: Reasoning over scientific plots. In WACV, pages 1516–1525, 2020.
- Gpt-4 technical report, 2024.
- Visual chain of thought: Bridging logical gaps with multimodal infillings, 2024.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024.
- Bootstrapping llm-based task-oriented dialogue agents via self-talk, 2024.
- Cogvlm: Visual expert for pretrained language models, 2023.
- Charxiv: Charting gaps in realistic chart understanding in multimodal llms. arXiv preprint arXiv:2406.18521, 2024.
- Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart reasoning. arXiv preprint arXiv:2402.12185, 2024.
- Interactive evolution: A neural-symbolic self-training framework for large language models. arXiv preprint arXiv:2406.11736, 2024.
- Symbol-llm: Towards foundational symbol-centric interface for large language models. arXiv preprint arXiv:2311.09278, 2023.
- Gpt3mix: Leveraging large-scale language models for text augmentation. In Findings of EMNLP, pages 2225–2239, 2021.
- Tinychart: Efficient chart understanding with visual token merging and program-of-thoughts learning. arXiv preprint arXiv: 2404.16635, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.