Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

55 1

TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning (2404.16635v1)

Published 25 Apr 2024 in cs.CV

Abstract: Charts are important for presenting and explaining complex data relationships. Recently, multimodal LLMs (MLLMs) have shown remarkable capabilities in various chart understanding tasks. However, the sheer size of these models in terms of parameters and computational requirements limits their use in resource-constrained environments. In this paper, we present TinyChart, an efficient MLLM for chart understanding with only 3B parameters. TinyChart overcomes two key challenges in efficient chart understanding: (1) reduce the burden of learning numerical computations through a Program-of-Thoughts (PoT) learning strategy, which trains the model to generate Python programs for numerical calculations, and (2) reduce lengthy vision feature sequences produced by the vision transformer for high-resolution images through a Vision Token Merging module, which gradually merges most similar vision tokens. Extensive experiments demonstrate that our 3B TinyChart achieves SOTA performance on a variety of chart understanding benchmarks including ChartQA, Chart-to-Text, Chart-to-Table, OpenCQA, and ChartX. It outperforms several chart understanding MLLM with up to 13B parameters such as ChartLlama and ChartAst, and close-sourced general-purpose MLLM GPT-4V on ChartQA. It also demonstrates its superior efficiency with higher throughput during inference due to a smaller model scale and more efficient vision encoding. Our code and model are available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/TinyChart.

PDF HTML Abstract

Efficient Multimodal LLM for Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

Overview

In the paper by Zhang et al., a new model, TinyChart, is introduced targeting efficient multimodal chart understanding tasks utilizing a significantly more compact architecture compared to existing solutions, with only 3 billion parameters. These efficiencies are achieved through innovative approaches: Visual Token Merging and Program-of-Thoughts Learning (PoT). Extensive experiments reveal that TinyChart not only effectively reduces computational demands but also surpasses the performance of models with up to 13 billion parameters across a suite of benchmarks.

Core Contributions

TinyChart:
- A streamlined model achieving state-of-the-art performance on various chart understanding benchmarks.
- Demonstrates higher inference throughput thanks to its reduced scale and efficient encoding methods.
Program-of-Thoughts Learning:
- Enhances numerical computation abilities in chart understanding tasks.
- A new dataset, ChartQA-PoT, supports PoT learning with both template and GPT-based generated programs.
Visual Token Merging:
- Proposes an efficient mechanism to handle high-resolution chart images by merging similar visual tokens, thereby effectively controlling computational overhead.

Technical Insights

Model Architecture:
- TinyChart integrates a vision transformer encoder with token merging capabilities, a vision-language connector, and a LLM for text generation.
- The visual encoder adopts a visual token merging strategy within each transformer layer, effectively reducing the length of feature sequences.
Program-of-Thoughts Learning:
- Facilitates the generation of Python programs for numerical calculations from questions, effectively teaching the model computational reasoning.
- The ChartQA-PoT dataset enriches training material with both manually curated templates and generative approaches using GPT models.
Efficiency and Performance:
- The visual token merging significantly enhances the handling of high-resolution inputs without proportional increases in computation, a key factor in maintaining model efficiency.
- TinyChart showcases superior performance metrics across several benchmarks, including ChartQA, Chart-to-Text, Chart-to-Table, and OpenCQA, consistently outperforming larger models.

Practical Implications and Future Research

The introduction of TinyChart and its underlying methodologies presents multiple avenues for further exploration and potential improvements in the field of multimodal machine learning:

Extended Applications:
- The strategies employed can be adapted to other forms of visual data processing beyond chart understanding, potentially benefiting tasks in image recognition and visual media analysis.
Optimization of Token Merging:
- Future models might explore dynamic token merging strategies that adjust based on context or content complexity, potentially offering even greater efficiency gains.
Advanced Program-of-Thoughts Learning:
- Investigating more sophisticated program generation techniques and expanding training datasets might improve handling complex numerical reasoning tasks and reduce dependency on predefined templates.

Overall, the results from this paper indicate robust possibilities not just for building more efficient LLMs but also for significantly enhancing their applicability and performance in practical, resource-constrained scenarios. Such advancements could lead to broader deployments of advanced AI technologies in everyday applications, making them accessible to a wider range of devices and platforms.

PDF Markdown Bookmark Chat (Pro)

References (61)

Authors (8)

Liang Zhang (357 papers)
Anwen Hu (22 papers)
Haiyang Xu (67 papers)
Ming Yan (190 papers)
Yichen Xu (40 papers)
Qin Jin (94 papers)
Ji Zhang (176 papers)
Fei Huang (408 papers)

Citations (13)

View on Semantic Scholar

GitHub

GitHub - X-PLUG/mPLUG-DocOwl: mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding (1,060 stars)

Tweets

https://twitter.com/xuhaiya2483846/status/1783727435835650239

https://twitter.com/xuhaiya2483846/status/1784833263623917937

https://twitter.com/xuhaiya2483846/status/1843129722529788200

https://twitter.com/xuhaiya2483846/status/1783676951506673867

https://twitter.com/SwankyView/status/1842267282115846461