Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning (2312.10160v2)

Published 15 Dec 2023 in cs.CL

Abstract: Recent advancements in large vision-LLMs (LVLMs) have led to significant progress in generating natural language descriptions for visual content and thus enhancing various applications. One issue with these powerful models is that they sometimes produce texts that are factually inconsistent with the visual input. While there has been some effort to mitigate such inconsistencies in natural image captioning, the factuality of generated captions for structured document images, such as charts, has not received as much scrutiny, posing a potential threat to information reliability in critical applications. This work delves into the factuality aspect by introducing a comprehensive typology of factual errors in generated chart captions. A large-scale human annotation effort provides insight into the error patterns and frequencies in captions crafted by various chart captioning models, ultimately forming the foundation of a novel dataset, CHOCOLATE. Our analysis reveals that even state-of-the-art models, including GPT-4V, frequently produce captions laced with factual inaccuracies. In response to this challenge, we establish the new task of Chart Caption Factual Error Correction and introduce CHARTVE, a model for visual entailment that outperforms proprietary and open-source LVLMs in evaluating factual consistency. Furthermore, we propose C2TFEC, an interpretable two-stage framework that excels at correcting factual errors. This work inaugurates a new domain in factual error correction for chart captions, presenting a novel evaluation mechanism, and demonstrating an effective approach to ensuring the factuality of generated chart captions. The code and data as well as the continuously updated benchmark can be found at: https://khuangaf.github.io/CHOCOLATE/.

Understanding Factual Errors in AI-Generated Chart Captions

Introduction to Chart Captioning Models

Chart captioning models have been increasingly proficient in generating natural language descriptions for visual content, including charts. This capability is key for data and business analysts, journalists, and others who depend on clear and accurate chart interpretations for reporting and decision-making. Despite the critical need for factual consistency in chart captions, research has yet to thoroughly address the factuality of such AI-generated text, which is essential for reliability in various applications.

Evaluating Factual Errors

To tackle the issue, a new dataset, named CHOCOLATE, focuses on identifying and typifying factual errors in chart captions. A substantial effort led to a broad categorization of errors, ranging from incorrect numeric values and mislabeled axes to entirely out-of-context information. Analysis of this dataset exhibited an alarming rate of factual errors across state-of-the-art captioning models, including task-specific models and large vision-LLMs (LVLMs), the latter also encompassing both proprietary (such as GPT-4V) and open-source solutions.

Progressing Towards Factual Correctness

Encountering these factual inaccuracies has given rise to the Chart Caption Factual Error Correction task, which hinges on producing a corrected caption that maintains high factual consistency with minimum edits to the original. A novel model called C2TF EC was proposed, which strategically improves factual accuracy through a two-step process. Initially, it translates the visual content of a chart into a structured table. Subsequently, leveraging the strong reasoning capabilities of LLMs (like GPT-4), it reviews and amends any inaccuracies based on the table data. The efficacy of C2TF EC is measured against both automatic evaluations and human assessments, where it has demonstrated superiority over other leading LVLMs.

Conclusions and Future Directions

The paper concludes with a pivotal contribution to the domain of artificial intelligence-generated content comprehensibility and accuracy. Constructing reliable content is crucial to maintaining trust in automated systems, and this investigation marks a significant stride towards enhancing the veracity of AI-generated chart captions. Future work may explore extending these factual error correction techniques to other forms of visual information and refining detection and correction algorithms for even greater accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Learning to revise references for faithful summarization. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4009–4027, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  2. Factual error correction for abstractive summarization models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6251–6258, Online. Association for Computational Linguistics.
  3. Improving faithfulness in abstractive summarization with contrast candidate generation and selection. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5935–5941, Online. Association for Computational Linguistics.
  4. Multi-fact correction in abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9320–9331, Online. Association for Computational Linguistics.
  5. Improving factual consistency in summarization with compression-based post-editing. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9149–9156, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  6. Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378.
  7. Doc2ppt: automatic presentation slides generation from scientific documents. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 634–642.
  8. InfoSurgeon: Cross-media fine-grained information consistency checking for fake news detection. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1683–1698, Online. Association for Computational Linguistics.
  9. The battlefront of combating misinformation and coping with media bias. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’22, page 4790–4791, New York, NY, USA. Association for Computing Machinery.
  10. Reference matters: Benchmarking factual error correction for dialogue summarization with fine-grained evaluation framework. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13932–13959, Toronto, Canada. Association for Computational Linguistics.
  11. Google. 2023a. Bard.
  12. Google. 2023b. Gemini - google deepmind.
  13. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798.
  14. Zero-shot faithful factual error correction. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5660–5676, Toronto, Canada. Association for Computational Linguistics.
  15. Manitweet: A new benchmark for identifying manipulation of news on social media. arXiv preprint arXiv:2305.14225.
  16. Faking fake news for real fake news detection: Propaganda-loaded training data generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14571–14589, Toronto, Canada. Association for Computational Linguistics.
  17. Dvqa: Understanding data visualizations via question answering. In CVPR.
  18. Figureqa: An annotated figure dataset for visual reasoning. ArXiv, abs/1710.07300.
  19. Chart-to-text: A large-scale benchmark for chart summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4005–4023, Dublin, Ireland. Association for Computational Linguistics.
  20. Maurice G Kendall. 1938. A new measure of rank correlation. Biometrika, 30(1/2):81–93.
  21. Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710. Soviet Union.
  22. Clip-event: Connecting text and images with event structures. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR2022).
  23. Cross-media structured common space for multimedia event extraction. In Proc. The 58th Annual Meeting of the Association for Computational Linguistics (ACL2020).
  24. DePlot: One-shot visual language reasoning by plot-to-table translation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 10381–10399, Toronto, Canada. Association for Computational Linguistics.
  25. MatCha: Enhancing visual language pretraining with math reasoning and chart derendering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12756–12770, Toronto, Canada. Association for Computational Linguistics.
  26. Visual news: Benchmark and challenges in news image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6761–6771, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  27. Improved baselines with visual instruction tuning.
  28. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland. Association for Computational Linguistics.
  29. Unichart: A universal vision-language pretrained model for chart comprehension and reasoning.
  30. Plotqa: Reasoning over scientific plots. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1527–1536.
  31. Plotqa: Reasoning over scientific plots. In The IEEE Winter Conference on Applications of Computer Vision (WACV).
  32. Situational awareness from social media photographs using automated image captioning. In 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 203–211. IEEE.
  33. OpenAI. 2023a. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  34. OpenAI. 2023b. Gpt-4v(ision) system card.
  35. Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4812–4829, Online. Association for Computational Linguistics.
  36. Amrfact: Enhancing summarization factuality evaluation with amr-driven training data generation. arXiv preprint arXiv:2311.09521.
  37. Automatic fact-guided sentence modification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8791–8798.
  38. VisText: A benchmark for semantically rich chart captioning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7268–7298, Toronto, Canada. Association for Computational Linguistics.
  39. Understanding factual errors in summarization: Errors, summarizers, datasets, error detectors. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11626–11644, Toronto, Canada. Association for Computational Linguistics.
  40. James Thorne and Andreas Vlachos. 2021. Evidence-based factual error correction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3298–3309, Online. Association for Computational Linguistics.
  41. Can large language models really improve by self-critiquing their own plans? arXiv preprint arXiv:2310.08118.
  42. David Wan and Mohit Bansal. 2022. FactPEGASUS: Factuality-aware pre-training and fine-tuning for abstractive summarization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1010–1028, Seattle, United States. Association for Computational Linguistics.
  43. Paxion: Patching video-language foundation models with action knowledge. In Proc. 2023 Conference on Neural Information Processing Systems (NeurIPS2023) [Spotlight Paper].
  44. Language models with image descriptors are strong few-shot video-language learners. In Proc. The Thirty-Sixth Annual Conference on Neural Information Processing Systems (NeurIPS2022).
  45. Cross-document misinformation detection based on event graph reasoning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 543–558, Seattle, United States. Association for Computational Linguistics.
  46. Optimizing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics, 4:401–415.
  47. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  48. Enhanced chart understanding via visual language pre-training on plot table pairs. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1314–1326, Toronto, Canada. Association for Computational Linguistics.
  49. Enhancing factual consistency of abstractive summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 718–733, Online. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Kung-Hsiang Huang (22 papers)
  2. Mingyang Zhou (27 papers)
  3. Hou Pong Chan (36 papers)
  4. Yi R. Fung (31 papers)
  5. Zhenhailong Wang (17 papers)
  6. Lingyu Zhang (21 papers)
  7. Shih-Fu Chang (131 papers)
  8. Heng Ji (266 papers)
Citations (27)

HackerNews

  1. Do LVLMs Understand Charts? (1 point, 0 comments)