Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning (2404.16635v1)

Published 25 Apr 2024 in cs.CV

Abstract: Charts are important for presenting and explaining complex data relationships. Recently, multimodal LLMs (MLLMs) have shown remarkable capabilities in various chart understanding tasks. However, the sheer size of these models in terms of parameters and computational requirements limits their use in resource-constrained environments. In this paper, we present TinyChart, an efficient MLLM for chart understanding with only 3B parameters. TinyChart overcomes two key challenges in efficient chart understanding: (1) reduce the burden of learning numerical computations through a Program-of-Thoughts (PoT) learning strategy, which trains the model to generate Python programs for numerical calculations, and (2) reduce lengthy vision feature sequences produced by the vision transformer for high-resolution images through a Vision Token Merging module, which gradually merges most similar vision tokens. Extensive experiments demonstrate that our 3B TinyChart achieves SOTA performance on a variety of chart understanding benchmarks including ChartQA, Chart-to-Text, Chart-to-Table, OpenCQA, and ChartX. It outperforms several chart understanding MLLM with up to 13B parameters such as ChartLlama and ChartAst, and close-sourced general-purpose MLLM GPT-4V on ChartQA. It also demonstrates its superior efficiency with higher throughput during inference due to a smaller model scale and more efficient vision encoding. Our code and model are available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/TinyChart.

Efficient Multimodal LLM for Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

Overview

In the paper by Zhang et al., a new model, TinyChart, is introduced targeting efficient multimodal chart understanding tasks utilizing a significantly more compact architecture compared to existing solutions, with only 3 billion parameters. These efficiencies are achieved through innovative approaches: Visual Token Merging and Program-of-Thoughts Learning (PoT). Extensive experiments reveal that TinyChart not only effectively reduces computational demands but also surpasses the performance of models with up to 13 billion parameters across a suite of benchmarks.

Core Contributions

  1. TinyChart:
    • A streamlined model achieving state-of-the-art performance on various chart understanding benchmarks.
    • Demonstrates higher inference throughput thanks to its reduced scale and efficient encoding methods.
  2. Program-of-Thoughts Learning:
    • Enhances numerical computation abilities in chart understanding tasks.
    • A new dataset, ChartQA-PoT, supports PoT learning with both template and GPT-based generated programs.
  3. Visual Token Merging:
    • Proposes an efficient mechanism to handle high-resolution chart images by merging similar visual tokens, thereby effectively controlling computational overhead.

Technical Insights

  • Model Architecture:
    • TinyChart integrates a vision transformer encoder with token merging capabilities, a vision-language connector, and a LLM for text generation.
    • The visual encoder adopts a visual token merging strategy within each transformer layer, effectively reducing the length of feature sequences.
  • Program-of-Thoughts Learning:
    • Facilitates the generation of Python programs for numerical calculations from questions, effectively teaching the model computational reasoning.
    • The ChartQA-PoT dataset enriches training material with both manually curated templates and generative approaches using GPT models.
  • Efficiency and Performance:
    • The visual token merging significantly enhances the handling of high-resolution inputs without proportional increases in computation, a key factor in maintaining model efficiency.
    • TinyChart showcases superior performance metrics across several benchmarks, including ChartQA, Chart-to-Text, Chart-to-Table, and OpenCQA, consistently outperforming larger models.

Practical Implications and Future Research

The introduction of TinyChart and its underlying methodologies presents multiple avenues for further exploration and potential improvements in the field of multimodal machine learning:

  • Extended Applications:
    • The strategies employed can be adapted to other forms of visual data processing beyond chart understanding, potentially benefiting tasks in image recognition and visual media analysis.
  • Optimization of Token Merging:
    • Future models might explore dynamic token merging strategies that adjust based on context or content complexity, potentially offering even greater efficiency gains.
  • Advanced Program-of-Thoughts Learning:
    • Investigating more sophisticated program generation techniques and expanding training datasets might improve handling complex numerical reasoning tasks and reduce dependency on predefined templates.

Overall, the results from this paper indicate robust possibilities not just for building more efficient LLMs but also for significantly enhancing their applicability and performance in practical, resource-constrained scenarios. Such advancements could lead to broader deployments of advanced AI technologies in everyday applications, making them accessible to a wider range of devices and platforms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv:2308.12966 [cs.CV]
  2. Token Merging: Your ViT But Faster. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=JroZRaRw7Eu
  3. OneChart: Purify the Chart Structural Extraction via One Auxiliary Token. arXiv:2404.09987 [cs.CV]
  4. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks. Transactions on Machine Learning Research (2023).
  5. InternLM-XComposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420 (2024).
  6. InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD. arXiv preprint arXiv:2404.06512 (2024).
  7. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. CoRR abs/2010.11929 (2020). arXiv:2010.11929 https://arxiv.org/abs/2010.11929
  8. DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding. arXiv preprint arXiv:2311.11810 (2023).
  9. Chartstamp: Robust chart embedding for real-world applications. In Proceedings of the 30th ACM International Conference on Multimedia. 2786–2795.
  10. Chartllama: A multimodal llm for chart understanding and generation. arXiv preprint arXiv:2311.16483 (2023).
  11. Dan Hendrycks and Kevin Gimpel. 2016. Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units. CoRR abs/1606.08415 (2016). arXiv:1606.08415 http://arxiv.org/abs/1606.08415
  12. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914 (2023).
  13. Question-controlled Text-aware Image Captioning. In Proceedings of the 29th ACM International Conference on Multimedia (Virtual Event, China) (MM ’21). Association for Computing Machinery, New York, NY, USA, 3097–3105. https://doi.org/10.1145/3474085.3475452
  14. mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model. arXiv:2311.18248 [cs.MM]
  15. mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding. arXiv:2403.12895 [cs.CV]
  16. From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models. arXiv:2403.12027 [cs.CL]
  17. OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation. arXiv:2311.17911 [cs.CV]
  18. Hallucination Augmented Contrastive Learning for Multimodal Large Language Model. arXiv:2312.06968 [cs.CV]
  19. Dvqa: Understanding data visualizations via question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5648–5656.
  20. OpenCQA: Open-ended Question Answering with Charts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 11817–11837. https://doi.org/10.18653/v1/2022.emnlp-main.811
  21. Chart-to-Text: A Large-Scale Benchmark for Chart Summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 4005–4023. https://doi.org/10.18653/v1/2022.acl-long.277
  22. Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding. arXiv:2311.16922 [cs.CV]
  23. Textbooks Are All You Need II: phi-1.5 technical report. arXiv:2309.05463 [cs.CL]
  24. Evaluating Object Hallucination in Large Vision-Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 292–305.
  25. SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models. arXiv:2311.07575 [cs.CV]
  26. DePlot: One-shot visual language reasoning by plot-to-table translation. In Findings of the Association for Computational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 10381–10399. https://doi.org/10.18653/v1/2023.findings-acl.660
  27. MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 12756–12770. https://doi.org/10.18653/v1/2023.acl-long.714
  28. Mmc: Advancing multimodal chart understanding with large-scale instruction tuning. arXiv preprint arXiv:2311.10774 (2023).
  29. Improved Baselines with Visual Instruction Tuning. arXiv:2310.03744 [cs.CV]
  30. Visual instruction tuning. Advances in neural information processing systems 36 (2024).
  31. On the Hidden Mystery of OCR in Large Multimodal Models. arXiv:2305.07895 [cs.CV]
  32. ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. In Findings of the Association for Computational Linguistics: ACL 2022. 2263–2279.
  33. UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning. arXiv:2305.14761 [cs.CL]
  34. ChartInstruct: Instruction Tuning for Chart Comprehension and Reasoning. arXiv:2403.09028 [cs.CL]
  35. ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning. arXiv preprint arXiv:2401.02384 (2024). arXiv:2401.02384
  36. Plotqa: Reasoning over scientific plots. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1527–1536.
  37. Jason Obeid and Enamul Hoque. 2020. Chart-to-Text: Generating Natural Language Descriptions for Charts by Adapting the Transformer Model. CoRR abs/2010.09142 (2020). arXiv:2010.09142 https://arxiv.org/abs/2010.09142
  38. OpenAI. 2023a. GPT-3.5-Turbo. https://platform.openai.com/docs/models/gpt-3-5-turbo.
  39. OpenAI. 2023b. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  40. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
  41. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.
  42. ChartSumm: A Comprehensive Benchmark for Automatic Chart Summarization of Long and Short Summaries. Proceedings of the Canadian Conference on Artificial Intelligence (jun 5 2023). https://caiac.pubpub.org/pub/ujhjycsw.
  43. Object Hallucination in Image Captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 4035–4045.
  44. Towards VQA Models That Can Read. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  45. VisText: A Benchmark for Semantically Rich Chart Captioning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 7268–7298. https://doi.org/10.18653/v1/2023.acl-long.401
  46. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023).
  47. The NumPy Array: A Structure for Efficient Numerical Computation. Computing in Science & Engineering 13, 2 (2011), 22–30. https://doi.org/10.1109/MCSE.2011.37
  48. Attention is all you need. Advances in neural information processing systems 30 (2017).
  49. An llm-free multi-dimensional benchmark for mllms hallucination evaluation. arXiv preprint arXiv:2311.07397 (2023).
  50. Evaluation and Analysis of Hallucination in Large Vision-Language Models. arXiv:2308.15126 [cs.LG]
  51. ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning. arXiv:2402.12185 [cs.CV]
  52. UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model. In EMNLP (Findings). Association for Computational Linguistics, 2841–2858.
  53. mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. arXiv:2304.14178 [cs.CL]
  54. mPLUG-Octopus: The Versatile Assistant Empowered by A Modularized End-to-End Multimodal LLM. In Proceedings of the 31st ACM International Conference on Multimedia. 9365–9367.
  55. mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration. arXiv:2311.04257 [cs.CL]
  56. Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective. arXiv preprint arXiv:2402.14545 (2024).
  57. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11975–11986.
  58. MPMQA: multimodal question answering on product manuals. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 13958–13966.
  59. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112 (2023).
  60. TRIE: end-to-end text reading and information extraction for document understanding. In Proceedings of the 28th ACM International Conference on Multimedia. 1413–1422.
  61. TinyLLaVA: A Framework of Small-scale Large Multimodal Models. arXiv:2402.14289 [cs.LG]
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Liang Zhang (357 papers)
  2. Anwen Hu (22 papers)
  3. Haiyang Xu (67 papers)
  4. Ming Yan (190 papers)
  5. Yichen Xu (40 papers)
  6. Qin Jin (94 papers)
  7. Ji Zhang (176 papers)
  8. Fei Huang (408 papers)
Citations (13)
Github Logo Streamline Icon: https://streamlinehq.com