UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning (2305.14761v3)
Abstract: Charts are very popular for analyzing data, visualizing key insights and answering complex reasoning questions about data. To facilitate chart-based data analysis using natural language, several downstream tasks have been introduced recently such as chart question answering and chart summarization. However, most of the methods that solve these tasks use pretraining on language or vision-language tasks that do not attempt to explicitly model the structure of the charts (e.g., how data is visually encoded and how chart elements are related to each other). To address this, we first build a large corpus of charts covering a wide variety of topics and visual styles. We then present UniChart, a pretrained model for chart comprehension and reasoning. UniChart encodes the relevant text, data, and visual elements of charts and then uses a chart-grounded text decoder to generate the expected output in natural language. We propose several chart-specific pretraining tasks that include: (i) low-level tasks to extract the visual elements (e.g., bars, lines) and data from charts, and (ii) high-level tasks to acquire chart understanding and reasoning skills. We find that pretraining the model on a large corpus with chart-specific low- and high-level tasks followed by finetuning on three down-streaming tasks results in state-of-the-art performance on three downstream tasks.
- Character region awareness for text detection. CoRR, abs/1904.01941.
- Beagle: Automated extraction and interpretation of visualizations from the web. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pages 1–8.
- Darwin Bautista and Rowel Atienza. 2022. Scene text recognition with permuted autoregressive sequence models. In European Conference on Computer Vision, pages 178–196, Cham. Springer Nature Switzerland.
- Anja Belz and Ehud Reiter. 2006. Comparing automatic and human evaluation of NLG systems. In 11th Conference of the European Chapter of the Association for Computational Linguistics, pages 313–320, Trento, Italy. Association for Computational Linguistics.
- D3: Data-driven documents. IEEE Transactions on Visualization & Computer Graphics (Proc. InfoVis).
- ChartInfo. 2022. Competition on harvesting raw tables from infographics.
- Vis30k: A collection of figures and tables from ieee visualization conference publications. IEEE Transactions on Visualization and Computer Graphics.
- Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer.
- Unifying vision-and-language tasks via text generation. In ICML.
- Visualizing for the non-visual: Enabling the visually impaired to use visualization. Computer Graphics Forum, 38.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- All that’s ‘human’ is not gold: Evaluating human evaluation of generated text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7282–7296, Online. Association for Computational Linguistics.
- Visimages: a corpus of visualizations in the images of visualization publications. arXiv preprint arXiv:2007.04584.
- Is gpt-3 a good data annotator? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, ACL’23, Toronto, Canada. ACL.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
- A survey of vision-language pre-trained models. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 5436–5443. International Joint Conferences on Artificial Intelligence Organization. Survey Track.
- Gptscore: Evaluate as you desire.
- Pal: Program-aided language models. arXiv preprint arXiv:2211.10435.
- Human-like summarization evaluation with chatgpt.
- News summarization and evaluation in the era of gpt-3.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
- Image captioning: Transforming objects into words. Advances in Neural Information Processing Systems, 32.
- TaPas: Weakly supervised table parsing via pre-training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4320–4333, Online. Association for Computational Linguistics.
- Chart question answering: State of the art and future directions. Journal of Computer Graphics Forum (Proc. EuroVis), pages 555–572.
- Vivo: Visual vocabulary pre-training for novel object captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1575–1583.
- Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, page 4083–4091, New York, NY, USA. Association for Computing Machinery.
- Seeing out of the box: End-to-end pre-training for vision-language representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12976–12985.
- Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR.
- Dvqa: Understanding data visualizations via question answering. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 5648–5656.
- Figureqa: An annotated figure dataset for visual reasoning. 6th International Conference on Learning Representations, ICLR 2018 - Workshop Track Proceedings, pages 1–20.
- Opencqa: Open-ended question answering with charts. In Proceedings of EMNLP (to appear).
- A diagram is worth a dozen images. In European conference on computer vision, pages 235–251. Springer.
- Ocr-free document understanding transformer. In European Conference on Computer Vision, pages 498–517. Springer.
- Vilt: Vision-and-language transformer without convolution or region supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 5583–5594. PMLR.
- Klaus Krippendorff. 2011. Computing krippendorff’s alpha-reliability.
- Neural text summarization: A critical evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 540–551, Hong Kong, China. Association for Computational Linguistics.
- Pix2struct: Screenshot parsing as pretraining for visual language understanding. arXiv preprint arXiv:2210.03347.
- BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. CoRR, abs/1910.13461.
- Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11336–11344.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML.
- Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS.
- Visualbert: A simple and performant baseline for vision and language.
- What does BERT with vision look at? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5265–5275, Online. Association for Computational Linguistics.
- TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
- Deplot: One-shot visual language reasoning by plot-to-table translation. arXiv preprint arXiv:2212.10505.
- Matcha: Enhancing visual language pretraining with math reasoning and chart derendering. arXiv preprint arXiv:2212.09662.
- G-eval: Nlg evaluation using gpt-4 with better human alignment.
- Revisiting the gold standard: Grounding summarization evaluation with robust human evaluation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, ACL’23, Toronto, Canada. ACL.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32.
- Alan Lundgard and Arvind Satyanarayan. 2021. Accessible visualization via natural language descriptions: A four-level model of semantic content. IEEE transactions on visualization and computer graphics, 28(1):1073–1083.
- Chartocr: Data extraction from charts images via a deep hybrid framework. 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1916–1924.
- Chatgpt as a factual inconsistency evaluator for text summarization.
- Show me: Automatic presentation for visual analysis. IEEE transactions on visualization and computer graphics, 13(6):1137–1144.
- Linecap: Line charts for data visualization captioning models. In 2022 IEEE Visualization and Visual Analytics (VIS), pages 35–39. IEEE.
- ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland. Association for Computational Linguistics.
- Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706.
- Plotqa: Reasoning over scientific plots. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).
- Tamara Munzner. 2014. Visualization Analysis and Design. CRC Press.
- Jason Obeid and Enamul Hoque. 2020. Chart-to-text: Generating natural language descriptions for charts by adapting the transformer model. In Proceedings of the 13th International Conference on Natural Language Generation, pages 138–147, Dublin, Ireland. Association for Computational Linguistics.
- OpenAI. 2022. Chatgpt: Optimizing language models for dialogue.
- Training language models to follow instructions with human feedback.
- Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
- Is chatgpt a general-purpose natural language processing task solver?
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
- Squad: 100, 000+ questions for machine comprehension of text. CoRR, abs/1606.05250.
- Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
- Vega-lite: A grammar of interactive graphics. IEEE transactions on visualization and computer graphics, 23(1):341–350.
- Chart-to-text: A large-scale benchmark for chart summarization. In In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2022.
- Calliope: Automatic visual data story generation from a spreadsheet. IEEE Transactions on Visualization and Computer Graphics, 27(2):453–463.
- Generation-focused table-based intermediate pre-training for free-form question answering. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10):11312–11320.
- Climbing mont BLEU: The strange world of reachable high-BLEU translations. In Proceedings of the 19th Annual Conference of the European Association for Machine Translation, pages 269–281.
- Andrea Spreafico and Giuseppe Carenini. 2020. Neural data-driven captioning of time-series line charts. In Proceedings of the International Conference on Advanced Visual Interfaces, AVI ’20, New York, NY, USA. Association for Computing Machinery.
- Striking a balance: Reader takeaways and preferences when integrating text and charts. IEEE Transactions on Visualization and Computer Graphics, 29(1):1233–1243.
- Vl-bert: Pre-training of generic visual-linguistic representations. In International Conference on Learning Representations.
- Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.
- An awkward disparity between BLEU / RIBES scores and human judgements in machine translation. In Proceedings of the 2nd Workshop on Asian Translation (WAT2015), pages 74–81, Kyoto, Japan. Workshop on Asian Translation.
- Unifying vision, text, and layout for universal document processing. arXiv preprint arXiv:2212.02623.
- Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164.
- Lilt: A simple yet effective language-independent layout transformer for structured document understanding. arXiv preprint arXiv:2202.13669.
- Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904.
- WDC. 2022. Web data commons, extracting structured data from the common crawl.
- Xgpt: Cross-modal generative pre-training for image captioning. In CCF International Conference on Natural Language Processing and Chinese Computing, pages 786–797. Springer.
- Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. arXiv preprint arXiv:2012.14740.
- Layoutlm: Pre-training of text and layout for document image understanding. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1192–1200.
- Verify-and-edit: A knowledge-enhanced chain-of-thought framework. arXiv preprint arXiv:2305.03268.
- Ahmed Masry (13 papers)
- Parsa Kavehzadeh (7 papers)
- Xuan Long Do (12 papers)
- Enamul Hoque (26 papers)
- Shafiq Joty (187 papers)