Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ChartInstruct: Instruction Tuning for Chart Comprehension and Reasoning (2403.09028v1)

Published 14 Mar 2024 in cs.CL

Abstract: Charts provide visual representations of data and are widely used for analyzing information, addressing queries, and conveying insights to others. Various chart-related downstream tasks have emerged recently, such as question-answering and summarization. A common strategy to solve these tasks is to fine-tune various models originally trained on vision tasks language. However, such task-specific models are not capable of solving a wide range of chart-related tasks, constraining their real-world applicability. To overcome these challenges, we introduce ChartInstruct: a novel chart-specific vision-language Instruction-following dataset comprising 191K instructions generated with 71K charts. We then present two distinct systems for instruction tuning on such datasets: (1) an end-to-end model that connects a vision encoder for chart understanding with a LLM; and (2) a pipeline model that employs a two-step approach to extract chart data tables and input them into the LLM. In experiments on four downstream tasks, we first show the effectiveness of our model--achieving a new set of state-of-the-art results. Further evaluation shows that our instruction-tuning approach supports a wide array of real-world chart comprehension and reasoning scenarios, thereby expanding the scope and applicability of our models to new kinds of tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Reading and reasoning over chart images for evidence-based automated fact-checking. arXiv preprint arXiv:2301.11843.
  2. Chartcheck: An evidence-based fact-checking dataset over real-world chart images. arXiv preprint arXiv:2311.07453.
  3. Alpaca. 2023. Alpaca. https://crfm.stanford.edu/2023/03/13/alpaca.html.
  4. Berkeley. 2024. Berkeley neural parser.
  5. Language models are few-shot learners.
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  7. Unifying vision-and-language tasks via text generation. In ICML.
  8. Visualizing for the non-visual: Enabling the visually impaired to use visualization. Computer Graphics Forum, 38.
  9. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  10. Instructblip: Towards general-purpose vision-language models with instruction tuning.
  11. Do llms work on charts? designing few-shot prompts for chart question answering and summarization.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
  13. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435.
  14. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR.
  15. News summarization and evaluation in the era of gpt-3.
  16. Chartllama: A multimodal llm for chart understanding and generation. arXiv preprint arXiv:2311.16483.
  17. TaPas: Weakly supervised table parsing via pre-training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4320–4333, Online. Association for Computational Linguistics.
  18. Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689.
  19. Enamul Hoque and Maneesh Agrawala. 2019. Searching the visual style and structure of d3 visualizations. IEEE transactions on visualization and computer graphics, 26(1):1236–1245.
  20. Chart question answering: State of the art and future directions. Journal of Computer Graphics Forum (Proc. EuroVis), pages 555–572.
  21. Do lvlms understand charts? analyzing and correcting factual errors in chart captioning.
  22. Mistral 7b.
  23. OpenCQA: Open-ended question answering with charts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11817–11837, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  24. Opencqa: Open-ended question answering with charts. In Proceedings of EMNLP (to appear).
  25. Chart-to-text: A large-scale benchmark for chart summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4005–4023, Dublin, Ireland. Association for Computational Linguistics.
  26. Pix2struct: Screenshot parsing as pretraining for visual language understanding. arXiv preprint arXiv:2210.03347.
  27. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890.
  28. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  29. Deplot: One-shot visual language reasoning by plot-to-table translation. arXiv preprint arXiv:2212.10505.
  30. Matcha: Enhancing visual language pretraining with math reasoning and chart derendering. arXiv preprint arXiv:2212.09662.
  31. Mmc: Advancing multimodal chart understanding with large-scale instruction tuning. arXiv preprint arXiv:2311.10774.
  32. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  33. G-eval: Nlg evaluation using gpt-4 with better human alignment.
  34. Ahmed Masry and Enamul Hoque. 2021. Integrating image data extraction and table parsing methods for chart question answering. Chart Question Answering Workshop, in conjunction with the Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–5.
  35. UniChart: A universal vision-language pretrained model for chart comprehension and reasoning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (to appear). Association for Computational Linguistics.
  36. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland. Association for Computational Linguistics.
  37. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244.
  38. Plotqa: Reasoning over scientific plots. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).
  39. Jason Obeid and Enamul Hoque. 2020a. Chart-to-text: Generating natural language descriptions for charts by adapting the transformer model. In Proceedings of the 13th International Conference on Natural Language Generation, pages 138–147, Dublin, Ireland. Association for Computational Linguistics.
  40. Jason Obeid and Enamul Hoque. 2020b. Chart-to-text: Generating natural language descriptions for charts by adapting the transformer model. In Proceedings of the 13th International Conference on Natural Language Generation, pages 138–147. Association for Computational Linguistics.
  41. OCED. -. Organisation for economic co-operation and development (oecd). https://www.oecd.org. Accessed: Jan 2024.
  42. OpenAI. 2023. GPT-4 Technical Report.
  43. OpenAI-Blog. 2022. Chatgpt: Optimizing language models for dialogue.
  44. Training language models to follow instructions with human feedback.
  45. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
  46. OWID. -. Our world in data (owid). https://ourworldindata.org/. Accessed: Jan 2024.
  47. Pew. -. Pew research center. https://www.pewresearch.org/. Accessed: Jan 2024.
  48. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
  49. Learning transferable visual models from natural language supervision.
  50. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  51. Toolformer: Language models can teach themselves to use tools.
  52. Chart-to-text: A large-scale benchmark for chart summarization. In In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2022.
  53. Calliope: Automatic visual data story generation from a spreadsheet. IEEE Transactions on Visualization and Computer Graphics, 27(2):453–463.
  54. Climbing mont BLEU: The strange world of reachable high-BLEU translations. In Proceedings of the 19th Annual Conference of the European Association for Machine Translation, pages 269–281.
  55. statista. -. Statista. https://www.statista.com/. Accessed: 2024.
  56. Gemini: A family of highly capable multimodal models.
  57. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  58. Domino: A dual-system for multi-step visual language reasoning. arXiv preprint arXiv:2310.02804.
  59. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560.
  60. mplug-owl: Modularization empowers large language models with multimodality.
  61. Enhanced chart understanding via visual language pre-training on plot table pairs. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1314–1326, Toronto, Canada. Association for Computational Linguistics.
  62. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ahmed Masry (13 papers)
  2. Mehrad Shahmohammadi (2 papers)
  3. Md Rizwan Parvez (24 papers)
  4. Enamul Hoque (26 papers)
  5. Shafiq Joty (187 papers)
Citations (22)
X Twitter Logo Streamline Icon: https://streamlinehq.com