OneChart: Purify the Chart Structural Extraction via One Auxiliary Token (2404.09987v2)
Abstract: Chart parsing poses a significant challenge due to the diversity of styles, values, texts, and so forth. Even advanced large vision-LLMs (LVLMs) with billions of parameters struggle to handle such tasks satisfactorily. To address this, we propose OneChart: a reliable agent specifically devised for the structural extraction of chart information. Similar to popular LVLMs, OneChart incorporates an autoregressive main body. Uniquely, to enhance the reliability of the numerical parts of the output, we introduce an auxiliary token placed at the beginning of the total tokens along with an additional decoder. The numerically optimized (auxiliary) token allows subsequent tokens for chart parsing to capture enhanced numerical features through causal attention. Furthermore, with the aid of the auxiliary token, we have devised a self-evaluation mechanism that enables the model to gauge the reliability of its chart parsing results by providing confidence scores for the generated content. Compared to current state-of-the-art (SOTA) chart parsing models, e.g., DePlot, ChartVLM, ChartAst, OneChart significantly outperforms in Average Precision (AP) for chart structural extraction across multiple public benchmarks, despite enjoying only 0.2 billion parameters. Moreover, as a chart parsing agent, it also brings 10%+ accuracy gains for the popular LVLM (LLaVA-1.6) in the downstream ChartQA benchmark.
- Human vs. machine eye for chart interpretation. In 2022 IEEE Region 10 Symposium (TENSYMP), pages 1–6. IEEE, 2022.
- Automatic chart understanding: a review. IEEE Access, 2023.
- Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022.
- Plotqa: Reasoning over scientific plots. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1516–1525, 2020.
- What is wrong with scene text recognition model comparisons? dataset and model analysis. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4715–4723, 2019.
- Chartocr: Data extraction from charts images via a deep hybrid framework. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1917–1925, 2021.
- Cornernet: Detecting objects as paired keypoints. In Proceedings of the European conference on computer vision (ECCV), pages 734–750, 2018.
- Fetaqa: Free-form table question answering. Transactions of the Association for Computational Linguistics, 10:35–49, 2022.
- Tapas: Weakly supervised table parsing via pre-training. arXiv preprint arXiv:2004.02349, 2020.
- Visual instruction tuning, 2023.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
- Next-chat: An lmm for chat, detection and segmentation, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Vary: Scaling up the vision vocabulary for large vision-language models. arXiv preprint arXiv:2312.06109, 2023.
- Small language model meets with reinforced vision vocabulary. arXiv preprint arXiv:2401.12503, 2024.
- Chatspot: Bootstrapping multimodal llms via precise referring instruction tuning. arXiv preprint arXiv:2307.09474, 2023.
- Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499, 2023.
- Merlin: Empowering multimodal llms with foresight minds. arXiv preprint arXiv:2312.00589, 2023.
- Matcha: Enhancing visual language pretraining with math reasoning and chart derendering. arXiv preprint arXiv:2212.09662, 2022.
- Chartassisstant: A universal chart multimodal language model via chart-to-table pre-training and multitask instruction tuning. arXiv preprint arXiv:2401.02384, 2024.
- Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart reasoning, 2024.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Structchart: Perception, structuring, reasoning for visual chart understanding. arXiv preprint arXiv:2309.11268, 2023.
- Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
- Data extraction from charts via single deep neural network. arXiv preprint arXiv:1906.11906, 2019.
- Chartreader: Automatic parsing of bar-plots. In 2021 IEEE 22nd International Conference on Information Reuse and Integration for Data Science (IRI), pages 318–325. IEEE, 2021.
- Reading and reasoning over chart images for evidence-based automated fact-checking. arXiv preprint arXiv:2301.11843, 2023.
- Deplot: One-shot visual language reasoning by plot-to-table translation. In Findings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023.
- Unichart: A universal vision-language pretrained model for chart comprehension and reasoning. arXiv preprint arXiv:2305.14761, 2023.
- Chartllama: A multimodal llm for chart understanding and generation. arXiv preprint arXiv:2311.16483, 2023.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Chart-to-text: Generating natural language descriptions for charts by adapting the transformer model. arXiv preprint arXiv:2010.09142, 2020.
- Mmc: Advancing multimodal chart understanding with large-scale instruction tuning. arXiv preprint arXiv:2311.10774, 2023.
- Liang Xu. Nlpcc2019: Large-scale chinese datasets for nlp. http://github.com/brightmart/nlp_chinese_corpus.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/, 2023.
- Nougat: Neural optical understanding for academic documents. arXiv preprint arXiv:2308.13418, 2023.
- Vladimir I Levenshtein et al. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710. Soviet Union, 1966.
- Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pages 18893–18912. PMLR, 2023.
- mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499, 2023.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
- Jinyue Chen (5 papers)
- Lingyu Kong (13 papers)
- Haoran Wei (55 papers)
- Chenglong Liu (11 papers)
- Zheng Ge (60 papers)
- Liang Zhao (353 papers)
- Jianjian Sun (23 papers)
- Chunrui Han (21 papers)
- Xiangyu Zhang (328 papers)