StructChart: On the Schema, Metric, and Augmentation for Visual Chart Understanding (2309.11268v5)
Abstract: Charts are common in literature across various scientific fields, conveying rich information easily accessible to readers. Current chart-related tasks focus on either chart perception that extracts information from the visual charts, or chart reasoning given the extracted data, e.g. in a tabular form. In this paper, we introduce StructChart, a novel framework that leverages Structured Triplet Representations (STR) to achieve a unified and label-efficient approach to chart perception and reasoning tasks, which is generally applicable to different downstream tasks, beyond the question-answering task as specifically studied in peer works. Specifically, StructChart first reformulates the chart data from the tubular form (linearized CSV) to STR, which can friendlily reduce the task gap between chart perception and reasoning. We then propose a Structuring Chart-oriented Representation Metric (SCRM) to quantitatively evaluate the chart perception task performance. To augment the training, we further explore the potential of LLMs to enhance the diversity in both chart visual style and statistical information. Extensive experiments on various chart-related tasks demonstrate the effectiveness and potential of a unified chart perception-reasoning paradigm to push the frontier of chart understanding.
- Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, volume 33, pp. 1877–1901, 2020.
- Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. ArXiv, abs/2211.12588, 2022a.
- Pali: A jointly-scaled multilingual language-image model. ArXiv, abs/2209.06794, 2022b.
- Unifying vision-and-language tasks via text generation. ArXiv, abs/2102.02779, 2021.
- Visualizing for the non‐visual: Enabling the visually impaired to use visualization. Computer Graphics Forum, 38, 2019.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv, abs/2010.11929, 2020.
- Self-supervised global-local structure modeling for point cloud domain adaptation with reliable voted pseudo labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6377–6386, 2022.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009, 2022.
- Tabllm: Few-shot classification of tabular data with large language models. In International Conference on Artificial Intelligence and Statistics, pp. 5549–5581. PMLR, 2023.
- Tapas: Weakly supervised table parsing via pre-training. In Annual Meeting of the Association for Computational Linguistics, 2020.
- Seeing out of the box: End-to-end pre-training for vision-language representation learning. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12971–12980, 2021.
- Wenlan: Bridging vision and language by large-scale multi-modal pre-training. ArXiv, abs/2103.06561, 2021.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, 2021.
- Figureqa: An annotated figure dataset for visual reasoning. ArXiv, abs/1710.07300, 2017.
- Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, 2021.
- Cornernet: Detecting objects as paired keypoints. International Journal of Computer Vision, 128:642 – 656, 2018.
- Pix2struct: Screenshot parsing as pretraining for visual language understanding. In Proceedings of the 40th International Conference on Machine Learning, 2022.
- Align before fuse: Vision and language representation learning with momentum distillation. In Neural Information Processing Systems, 2021a.
- Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. ArXiv, abs/2012.15409, 2020. URL https://api.semanticscholar.org/CorpusID:229924402.
- Unimo-2: End-to-end unified vision-language grounded learning. ArXiv, abs/2203.09067, 2022. URL https://api.semanticscholar.org/CorpusID:247519008.
- Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. ArXiv, abs/2110.05208, 2021b.
- Deplot: One-shot visual language reasoning by plot-to-table translation. ArXiv, abs/2212.10505, 2022a.
- Matcha: Enhancing visual language pretraining with math reasoning and chart derendering. In Annual Meeting of the Association for Computational Linguistics, 2022b.
- Visual instruction tuning. ArXiv, abs/2304.08485, 2023.
- Data extraction from charts via single deep neural network. ArXiv, abs/1906.11906, 2019.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Neural Information Processing Systems, 2019.
- Chartocr: Data extraction from charts images via a deep hybrid framework. 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1916–1924, 2021.
- Chartqa: A benchmark for question answering about charts with visual and logical reasoning. ArXiv, abs/2203.10244, 2022.
- Plotqa: Reasoning over scientific plots. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1516–1525, 2019.
- Stunt: Few-shot tabular learning with self-generated tasks from unlabeled tables. arXiv preprint arXiv:2303.00918, 2023.
- Chart-to-text: Generating natural language descriptions for charts by adapting the transformer model. ArXiv, abs/2010.09142, 2020.
- OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Chartreader: Automatic parsing of bar-plots. 2021 IEEE 22nd International Conference on Information Reuse and Integration for Data Science (IRI), pp. 318–325, 2021.
- Revision: automated classification, analysis and redesign of chart images. Proceedings of the 24th annual ACM symposium on User interface software and technology, 2011.
- Flava: A foundational language and vision alignment model. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15617–15629, 2021.
- Vl-bert: Pre-training of generic visual-linguistic representations. ArXiv, abs/1908.08530, 2019.
- Trust issues: Uncertainty estimation does not enable reliable ood detection on medical tabular data. In Machine Learning for Health, pp. 341–354. PMLR, 2020.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. In NeurIPS, 2022.
- Simvlm: Simple visual language model pretraining with weak supervision. ArXiv, abs/2108.10904, 2021.
- Chain of thought prompting elicits reasoning in large language models. ArXiv, abs/2201.11903, 2022.
- Probing inter-modality: Visual parsing with self-attention for vision-language pre-training. In Neural Information Processing Systems, 2021.
- Renqiu Xia (16 papers)
- Bo Zhang (633 papers)
- Haoyang Peng (6 papers)
- Hancheng Ye (17 papers)
- Xiangchao Yan (15 papers)
- Peng Ye (142 papers)
- Botian Shi (56 papers)
- Yu Qiao (563 papers)
- Junchi Yan (241 papers)
- Mingsheng Li (9 papers)