ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning (2401.02384v3)
Abstract: Charts play a vital role in data visualization, understanding data patterns, and informed decision-making. However, their unique combination of graphical elements (e.g., bars, lines) and textual components (e.g., labels, legends) poses challenges for general-purpose multimodal models. While vision-LLMs trained on chart data excel in comprehension, they struggle with generalization. To address these challenges, we propose ChartAssistant, a chart-based vision-LLM for universal chart comprehension and reasoning. ChartAssistant leverages ChartSFT, a comprehensive dataset covering diverse chart-related tasks with basic (e.g. bars and pies) and specialized (e.g. radars, and bubbles) chart types. It undergoes a two-stage training process, starting with pre-training on chart-to-table parsing to align chart and text, followed by multitask instruction-following fine-tuning. This approach enables ChartAssistant to achieve competitive performance across various chart tasks. Experimental results demonstrate significant performance gains over the state-of-the-art UniChart and Chartllama method, especially outperforming them on real-world chart data with zero-shot setting. The code and data are available at https://github.com/OpenGVLab/ChartAst.
- Arxiv. https://arxiv.org/.
- Nougat: Neural optical understanding for academic documents. arXiv preprint arXiv:2308.13418, 2023.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
- Recall and learn: Fine-tuning deep pretrained language models with less forgetting. arXiv preprint arXiv:2004.12651, 2020.
- Google. Bard. https://bard.google.com/, 2023.
- Chartllama: A multimodal llm for chart understanding and generation. arXiv preprint arXiv:2311.16483, 2023.
- Image captioning: Transforming objects into words. Advances in neural information processing systems, 32, 2019.
- Applying pragmatics principles for interaction with visual analytics. IEEE transactions on visualization and computer graphics, 24(1):309–318, 2017.
- Chart question answering: State of the art and future directions. In Computer Graphics Forum, pages 555–572. Wiley Online Library, 2022.
- Robert E Horn. Visual language. MacroVu Inc. Washington, 1998.
- Scicap: Generating captions for scientific figures. arXiv preprint arXiv:2110.11624, 2021.
- Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017.
- Opencqa: Open-ended question answering with charts. arXiv preprint arXiv:2210.06628, 2022a.
- Chart-to-text: A large-scale benchmark for chart summarization. arXiv preprint arXiv:2203.06486, 2022b.
- Donut: Document understanding transformer without ocr. arXiv preprint arXiv:2111.15664, 7:15, 2021.
- Ocr-free document understanding transformer. In European Conference on Computer Vision, pages 498–517. Springer, 2022.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pages 18893–18912. PMLR, 2023.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs. arXiv preprint arXiv:2308.03349, 2023.
- Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
- Deplot: One-shot visual language reasoning by plot-to-table translation. arXiv preprint arXiv:2212.10505, 2022a.
- Matcha: Enhancing visual language pretraining with math reasoning and chart derendering. arXiv preprint arXiv:2212.09662, 2022b.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- Kosmos-2.5: A multimodal literate model. arXiv preprint arXiv:2309.11419, 2023.
- Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022.
- Unichart: A universal vision-language pretrained model for chart comprehension and reasoning. arXiv preprint arXiv:2305.14761, 2023.
- Plotqa: Reasoning over scientific plots. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1527–1536, 2020.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- ChartSumm: A large scale benchmark for Chart to Text Summarization. PhD thesis, Department of Computer Science and Engineering (CSE), Islamic University of …, 2022.
- Tiny lvlm-ehub: Early multimodal experiments with bard. arXiv preprint arXiv:2308.03729, 2023.
- Vistext: A benchmark for semantically rich chart captioning. arXiv preprint arXiv:2307.05356, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16133–16142, 2023.
- Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023.
- Forget me not: Reducing catastrophic forgetting for domain adaptation in reading comprehension. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2020.
- Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023.
- Transfer visual prompt generator across llms. arXiv preprint arXiv:2305.01278, 2023a.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023b.
- Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601, 2023c.
- Enhanced chart understanding in vision and language task via cross-modal pre-training on plot table pairs. arXiv preprint arXiv:2305.18641, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Fanqing Meng (14 papers)
- Wenqi Shao (89 papers)
- Quanfeng Lu (10 papers)
- Peng Gao (401 papers)
- Kaipeng Zhang (73 papers)
- Yu Qiao (563 papers)
- Ping Luo (340 papers)