mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning (2404.01548v1)
Abstract: In the fields of computer vision and natural language processing, multimodal chart question-answering, especially involving color, structure, and textless charts, poses significant challenges. Traditional methods, which typically involve either direct multimodal processing or a table-to-text conversion followed by LLM analysis, have limitations in effectively handling these complex scenarios. This paper introduces a novel multimodal chart question-answering model, specifically designed to address these intricate tasks. Our model integrates visual and linguistic processing, overcoming the constraints of existing methods. We adopt a dual-phase training approach: the initial phase focuses on aligning image and text representations, while the subsequent phase concentrates on optimizing the model's interpretative and analytical abilities in chart-related queries. This approach has demonstrated superior performance on multiple public datasets, particularly in handling color, structure, and textless chart questions, indicating its effectiveness in complex multimodal tasks.
- Nocaps: Novel object captioning at scale, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 8948–8957.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 .
- Leaf-qa: Locate, encode & attend for figure question answering, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3512–3521.
- Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 .
- Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565 .
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 .
- Pali-3 vision language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199 .
- Chartreader: A unified framework for chart derendering and comprehension without heuristic rules, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22202–22213.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems 36.
- Datatone: Managing ambiguity in natural language interfaces for data visualization, in: Proceedings of the 28th annual acm symposium on user interface software & technology, pp. 489–500.
- Multimodal document analytics for banking process automation. arXiv preprint arXiv:2307.11845 .
- Chart question answering: State of the art and future directions, in: Computer Graphics Forum, Wiley Online Library. pp. 555–572.
- Applying pragmatics principles for interaction with visual analytics. IEEE transactions on visualization and computer graphics 24, 309–318.
- A novel extended multimodal ai framework towards vulnerability detection in smart contracts. Information Sciences 636, 118907.
- Answering questions about data visualizations using efficient bimodal fusion, in: Proceedings of the IEEE/CVF Winter conference on applications of computer vision, pp. 1498–1507.
- Figureqa: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300 .
- Referitgame: Referring to objects in photographs of natural scenes, in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 787–798.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123, 32–73.
- Pix2struct: Screenshot parsing as pretraining for visual language understanding, in: International Conference on Machine Learning, PMLR. pp. 18893–18912.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, in: International conference on machine learning, PMLR. pp. 19730–19742.
- DePlot: One-shot visual language reasoning by plot-to-table translation, in: Rogers, A., Boyd-Graber, J., Okazaki, N. (Eds.), Findings of the Association for Computational Linguistics: ACL 2023, Association for Computational Linguistics, Toronto, Canada. pp. 10381–10399.
- MatCha: Enhancing visual language pretraining with math reasoning and chart derendering, in: Rogers, A., Boyd-Graber, J., Okazaki, N. (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada. pp. 12756–12770.
- Generation and comprehension of unambiguous object descriptions, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 11–20.
- ChartQA: A benchmark for question answering about charts with visual and logical reasoning, in: Muresan, S., Nakov, P., Villavicencio, A. (Eds.), Findings of the Association for Computational Linguistics: ACL 2022, Association for Computational Linguistics, Dublin, Ireland. pp. 2263–2279.
- Unichart: A universal vision-language pretrained model for chart comprehension and reasoning. arXiv preprint arXiv:2305.14761 .
- Chartassisstant: A universal chart multimodal language model via chart-to-table pre-training and multitask instruction tuning. arXiv preprint arXiv:2401.02384 .
- The impact of multimodal large language models on health care’s future. Journal of Medical Internet Research 25, e52865.
- Plotqa: Reasoning over scientific plots, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1527–1536.
- Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems 24.
- A multimodal computer-aided diagnostic system for depression relapse prediction using audiovisual cues: A proof of concept. Healthcare Analytics 2, 100090.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 .
- Figurenet: A deep learning model for question-answering on scientific plots, in: 2019 International Joint Conference on Neural Networks (IJCNN), IEEE. pp. 1–8.
- Fusion of electronic health records and radiographic images for a multimodal deep learning prediction model of atypical femur fractures. Computers in Biology and Medicine 168, 107704.
- Eviza: A natural language interface for visual analysis, in: Proceedings of the 29th annual symposium on user interface software and technology, pp. 365–377.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565.
- Stl-cqa: Structure-based transformers with localization and encoding for chart question answering, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3275–3284.
- Orko: Facilitating multimodal interaction for visual exploration and analysis of networks. IEEE transactions on visualization and computer graphics 24, 511–521.
- Attentive statement fraud detection: Distinguishing multimodal financial data with fine-grained attention. Decision Support Systems 167, 113913.
- Domino: A dual-system for multi-step visual language reasoning. arXiv preprint arXiv:2310.02804 .
- Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart reasoning. arXiv preprint arXiv:2402.12185 .
- Mufasa: Multimodal fusion architecture search for electronic health records, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 10532–10540.
- mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499 .
- Enhanced chart understanding via visual language pre-training on plot table pairs, in: The 61st Annual Meeting Of The Association For Computational Linguistics, p. 22202.
- Jingxuan Wei (21 papers)
- Nan Xu (83 papers)
- Guiyong Chang (2 papers)
- Yin Luo (5 papers)
- Ruifeng Guo (10 papers)
- Bihui Yu (16 papers)