Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning (2401.02384v3)

Published 4 Jan 2024 in cs.CV

Abstract: Charts play a vital role in data visualization, understanding data patterns, and informed decision-making. However, their unique combination of graphical elements (e.g., bars, lines) and textual components (e.g., labels, legends) poses challenges for general-purpose multimodal models. While vision-LLMs trained on chart data excel in comprehension, they struggle with generalization. To address these challenges, we propose ChartAssistant, a chart-based vision-LLM for universal chart comprehension and reasoning. ChartAssistant leverages ChartSFT, a comprehensive dataset covering diverse chart-related tasks with basic (e.g. bars and pies) and specialized (e.g. radars, and bubbles) chart types. It undergoes a two-stage training process, starting with pre-training on chart-to-table parsing to align chart and text, followed by multitask instruction-following fine-tuning. This approach enables ChartAssistant to achieve competitive performance across various chart tasks. Experimental results demonstrate significant performance gains over the state-of-the-art UniChart and Chartllama method, especially outperforming them on real-world chart data with zero-shot setting. The code and data are available at https://github.com/OpenGVLab/ChartAst.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Arxiv. https://arxiv.org/.
  2. Nougat: Neural optical understanding for academic documents. arXiv preprint arXiv:2308.13418, 2023.
  3. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  4. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
  5. Recall and learn: Fine-tuning deep pretrained language models with less forgetting. arXiv preprint arXiv:2004.12651, 2020.
  6. Google. Bard. https://bard.google.com/, 2023.
  7. Chartllama: A multimodal llm for chart understanding and generation. arXiv preprint arXiv:2311.16483, 2023.
  8. Image captioning: Transforming objects into words. Advances in neural information processing systems, 32, 2019.
  9. Applying pragmatics principles for interaction with visual analytics. IEEE transactions on visualization and computer graphics, 24(1):309–318, 2017.
  10. Chart question answering: State of the art and future directions. In Computer Graphics Forum, pages 555–572. Wiley Online Library, 2022.
  11. Robert E Horn. Visual language. MacroVu Inc. Washington, 1998.
  12. Scicap: Generating captions for scientific figures. arXiv preprint arXiv:2110.11624, 2021.
  13. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017.
  14. Opencqa: Open-ended question answering with charts. arXiv preprint arXiv:2210.06628, 2022a.
  15. Chart-to-text: A large-scale benchmark for chart summarization. arXiv preprint arXiv:2203.06486, 2022b.
  16. Donut: Document understanding transformer without ocr. arXiv preprint arXiv:2111.15664, 7:15, 2021.
  17. Ocr-free document understanding transformer. In European Conference on Computer Vision, pages 498–517. Springer, 2022.
  18. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  19. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pages 18893–18912. PMLR, 2023.
  20. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
  21. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  22. Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs. arXiv preprint arXiv:2308.03349, 2023.
  23. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
  24. Deplot: One-shot visual language reasoning by plot-to-table translation. arXiv preprint arXiv:2212.10505, 2022a.
  25. Matcha: Enhancing visual language pretraining with math reasoning and chart derendering. arXiv preprint arXiv:2212.09662, 2022b.
  26. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  27. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  28. Kosmos-2.5: A multimodal literate model. arXiv preprint arXiv:2309.11419, 2023.
  29. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022.
  30. Unichart: A universal vision-language pretrained model for chart comprehension and reasoning. arXiv preprint arXiv:2305.14761, 2023.
  31. Plotqa: Reasoning over scientific plots. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1527–1536, 2020.
  32. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  33. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  34. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  35. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  36. ChartSumm: A large scale benchmark for Chart to Text Summarization. PhD thesis, Department of Computer Science and Engineering (CSE), Islamic University of …, 2022.
  37. Tiny lvlm-ehub: Early multimodal experiments with bard. arXiv preprint arXiv:2308.03729, 2023.
  38. Vistext: A benchmark for semantically rich chart captioning. arXiv preprint arXiv:2307.05356, 2023.
  39. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  40. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.
  41. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  42. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16133–16142, 2023.
  43. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023.
  44. Forget me not: Reducing catastrophic forgetting for domain adaptation in reading comprehension. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2020.
  45. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023.
  46. Transfer visual prompt generator across llms. arXiv preprint arXiv:2305.01278, 2023a.
  47. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023b.
  48. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601, 2023c.
  49. Enhanced chart understanding in vision and language task via cross-modal pre-training on plot table pairs. arXiv preprint arXiv:2305.18641, 2023.
  50. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Fanqing Meng (14 papers)
  2. Wenqi Shao (89 papers)
  3. Quanfeng Lu (10 papers)
  4. Peng Gao (402 papers)
  5. Kaipeng Zhang (73 papers)
  6. Yu Qiao (563 papers)
  7. Ping Luo (340 papers)
Citations (35)