Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning (2402.12185v3)

Published 19 Feb 2024 in cs.CV

Abstract: Recently, many versatile Multi-modal LLMs (MLLMs) have emerged continuously. However, their capacity to query information depicted in visual charts and engage in reasoning based on the queried contents remains under-explored. In this paper, to comprehensively and rigorously benchmark the ability of the off-the-shelf MLLMs in the chart domain, we construct ChartX, a multi-modal evaluation set covering 18 chart types, 7 chart tasks, 22 disciplinary topics, and high-quality chart data. Besides, we develop ChartVLM to offer a new perspective on handling multi-modal tasks that strongly depend on interpretable patterns, such as reasoning tasks in the field of charts or geometric images. We evaluate the chart-related ability of mainstream MLLMs and our ChartVLM on the proposed ChartX evaluation set. Extensive experiments demonstrate that ChartVLM surpasses both versatile and chart-related large models, achieving results comparable to GPT-4V. We believe that our study can pave the way for further exploration in creating a more comprehensive chart evaluation set and developing more interpretable multi-modal models. Both ChartX and ChartVLM are available at: https://github.com/UniModal4Reasoning/ChartVLM

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Chartcheck: An evidence-based fact-checking dataset over real-world chart images. arXiv preprint arXiv:2311.07453, 2023.
  3. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  6. Chartllama: A multimodal llm for chart understanding and generation. arXiv preprint arXiv:2311.16483, 2023.
  7. Lineex: Data extraction from scientific line charts. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6213–6221, 2023.
  8. Text2analysis: A benchmark of table question answering with advanced data analysis and unclear queries. arXiv preprint arXiv:2312.13671, 2023.
  9. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  10. Do lvlms understand charts? analyzing and correcting factual errors in chart captioning. arXiv preprint arXiv:2312.10160, 2023a.
  11. Instruct2act: Mapping multi-modality instructions to robotic actions with large language model. arXiv preprint arXiv:2305.11176, 2023b.
  12. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. arXiv preprint arXiv:2210.12283, 2022.
  13. Dvqa: Understanding data visualizations via question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656, 2018.
  14. Chart-to-text: A large-scale benchmark for chart summarization. In Annual Meeting of the Association for Computational Linguistics, 2022.
  15. Opencqa: Open-ended question answering with charts. arXiv preprint arXiv:2210.06628, 2022.
  16. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pages 18893–18912. PMLR, 2023.
  17. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
  18. Deplot: One-shot visual language reasoning by plot-to-table translation. arXiv preprint arXiv:2212.10505, 2022a.
  19. Matcha: Enhancing visual language pretraining with math reasoning and chart derendering. arXiv preprint arXiv:2212.09662, 2022b.
  20. Mmc: Advancing multimodal chart understanding with large-scale instruction tuning. arXiv preprint arXiv:2311.10774, 2023a.
  21. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023b.
  22. Chartocr: Data extraction from charts images via a deep hybrid framework. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1917–1925, 2021.
  23. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022.
  24. Chartassisstant: A universal chart multimodal language model via chart-to-table pre-training and multitask instruction tuning. arXiv preprint arXiv:2401.02384, 2024.
  25. Plotqa: Reasoning over scientific plots. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1527–1536, 2020.
  26. Chart-to-text: Generating natural language descriptions for charts by adapting the transformer model. arXiv preprint arXiv:2010.09142, 2020.
  27. OpenAI. Gpt-4v(ision) system card. https://openai.com/contributions/gpt-4v, 2023.
  28. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  29. Chartreader: Automatic parsing of bar-plots. In 2021 IEEE 22nd International Conference on Information Reuse and Integration for Data Science (IRI), pages 318–325. IEEE, 2021.
  30. Analysing mathematical reasoning abilities of neural models. arXiv preprint arXiv:1904.01557, 2019.
  31. Chartgpt: Leveraging llms to generate charts from abstract natural language. arXiv preprint arXiv:2311.01920, 2023.
  32. Solving olympiad geometry without human demonstrations. Nature, 625(7995):476–482, 2024.
  33. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
  34. Structchart: Perception, structuring, reasoning for visual chart understanding. arXiv preprint arXiv:2309.11268, 2023.
  35. Chartbench: A benchmark for complex visual reasoning in charts. arXiv preprint arXiv:2312.15915, 2023.
  36. Leandojo: Theorem proving with retrieval-augmented language models. arXiv preprint arXiv:2306.15626, 2023a.
  37. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1), 2023b.
  38. Tablegpt: Towards unifying tables, nature language and commands into one gpt. arXiv preprint arXiv:2307.08674, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Renqiu Xia (16 papers)
  2. Bo Zhang (633 papers)
  3. Hancheng Ye (17 papers)
  4. Xiangchao Yan (15 papers)
  5. Qi Liu (485 papers)
  6. Hongbin Zhou (28 papers)
  7. Zijun Chen (56 papers)
  8. Min Dou (22 papers)
  9. Botian Shi (57 papers)
  10. Junchi Yan (241 papers)
  11. Yu Qiao (563 papers)
Citations (30)