Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning (2404.01548v1)

Published 2 Apr 2024 in cs.CV and cs.AI

Abstract: In the fields of computer vision and natural language processing, multimodal chart question-answering, especially involving color, structure, and textless charts, poses significant challenges. Traditional methods, which typically involve either direct multimodal processing or a table-to-text conversion followed by LLM analysis, have limitations in effectively handling these complex scenarios. This paper introduces a novel multimodal chart question-answering model, specifically designed to address these intricate tasks. Our model integrates visual and linguistic processing, overcoming the constraints of existing methods. We adopt a dual-phase training approach: the initial phase focuses on aligning image and text representations, while the subsequent phase concentrates on optimizing the model's interpretative and analytical abilities in chart-related queries. This approach has demonstrated superior performance on multiple public datasets, particularly in handling color, structure, and textless chart questions, indicating its effectiveness in complex multimodal tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Nocaps: Novel object captioning at scale, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 8948–8957.
  2. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 .
  3. Leaf-qa: Locate, encode & attend for figure question answering, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3512–3521.
  4. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 .
  5. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565 .
  6. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 .
  7. Pali-3 vision language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199 .
  8. Chartreader: A unified framework for chart derendering and comprehension without heuristic rules, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22202–22213.
  9. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems 36.
  10. Datatone: Managing ambiguity in natural language interfaces for data visualization, in: Proceedings of the 28th annual acm symposium on user interface software & technology, pp. 489–500.
  11. Multimodal document analytics for banking process automation. arXiv preprint arXiv:2307.11845 .
  12. Chart question answering: State of the art and future directions, in: Computer Graphics Forum, Wiley Online Library. pp. 555–572.
  13. Applying pragmatics principles for interaction with visual analytics. IEEE transactions on visualization and computer graphics 24, 309–318.
  14. A novel extended multimodal ai framework towards vulnerability detection in smart contracts. Information Sciences 636, 118907.
  15. Answering questions about data visualizations using efficient bimodal fusion, in: Proceedings of the IEEE/CVF Winter conference on applications of computer vision, pp. 1498–1507.
  16. Figureqa: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300 .
  17. Referitgame: Referring to objects in photographs of natural scenes, in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 787–798.
  18. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123, 32–73.
  19. Pix2struct: Screenshot parsing as pretraining for visual language understanding, in: International Conference on Machine Learning, PMLR. pp. 18893–18912.
  20. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, in: International conference on machine learning, PMLR. pp. 19730–19742.
  21. DePlot: One-shot visual language reasoning by plot-to-table translation, in: Rogers, A., Boyd-Graber, J., Okazaki, N. (Eds.), Findings of the Association for Computational Linguistics: ACL 2023, Association for Computational Linguistics, Toronto, Canada. pp. 10381–10399.
  22. MatCha: Enhancing visual language pretraining with math reasoning and chart derendering, in: Rogers, A., Boyd-Graber, J., Okazaki, N. (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada. pp. 12756–12770.
  23. Generation and comprehension of unambiguous object descriptions, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 11–20.
  24. ChartQA: A benchmark for question answering about charts with visual and logical reasoning, in: Muresan, S., Nakov, P., Villavicencio, A. (Eds.), Findings of the Association for Computational Linguistics: ACL 2022, Association for Computational Linguistics, Dublin, Ireland. pp. 2263–2279.
  25. Unichart: A universal vision-language pretrained model for chart comprehension and reasoning. arXiv preprint arXiv:2305.14761 .
  26. Chartassisstant: A universal chart multimodal language model via chart-to-table pre-training and multitask instruction tuning. arXiv preprint arXiv:2401.02384 .
  27. The impact of multimodal large language models on health care’s future. Journal of Medical Internet Research 25, e52865.
  28. Plotqa: Reasoning over scientific plots, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1527–1536.
  29. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems 24.
  30. A multimodal computer-aided diagnostic system for depression relapse prediction using audiovisual cues: A proof of concept. Healthcare Analytics 2, 100090.
  31. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 .
  32. Figurenet: A deep learning model for question-answering on scientific plots, in: 2019 International Joint Conference on Neural Networks (IJCNN), IEEE. pp. 1–8.
  33. Fusion of electronic health records and radiographic images for a multimodal deep learning prediction model of atypical femur fractures. Computers in Biology and Medicine 168, 107704.
  34. Eviza: A natural language interface for visual analysis, in: Proceedings of the 29th annual symposium on user interface software and technology, pp. 365–377.
  35. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565.
  36. Stl-cqa: Structure-based transformers with localization and encoding for chart question answering, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3275–3284.
  37. Orko: Facilitating multimodal interaction for visual exploration and analysis of networks. IEEE transactions on visualization and computer graphics 24, 511–521.
  38. Attentive statement fraud detection: Distinguishing multimodal financial data with fine-grained attention. Decision Support Systems 167, 113913.
  39. Domino: A dual-system for multi-step visual language reasoning. arXiv preprint arXiv:2310.02804 .
  40. Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart reasoning. arXiv preprint arXiv:2402.12185 .
  41. Mufasa: Multimodal fusion architecture search for electronic health records, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 10532–10540.
  42. mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499 .
  43. Enhanced chart understanding via visual language pre-training on plot table pairs, in: The 61st Annual Meeting Of The Association For Computational Linguistics, p. 22202.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jingxuan Wei (21 papers)
  2. Nan Xu (83 papers)
  3. Guiyong Chang (2 papers)
  4. Yin Luo (5 papers)
  5. Ruifeng Guo (10 papers)
  6. Bihui Yu (16 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.