ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation (2406.09961v1)

Published 14 Jun 2024 in cs.SE, cs.CL, and cs.CV

Abstract: We introduce a new benchmark, ChartMimic, aimed at assessing the visually-grounded code generation capabilities of large multimodal models (LMMs). ChartMimic utilizes information-intensive visual charts and textual instructions as inputs, requiring LMMs to generate the corresponding code for chart rendering. ChartMimic includes 1,000 human-curated (figure, instruction, code) triplets, which represent the authentic chart use cases found in scientific papers across various domains(e.g., Physics, Computer Science, Economics, etc). These charts span 18 regular types and 4 advanced types, diversifying into 191 subcategories. Furthermore, we propose multi-level evaluation metrics to provide an automatic and thorough assessment of the output code and the rendered charts. Unlike existing code generation benchmarks, ChartMimic places emphasis on evaluating LMMs' capacity to harmonize a blend of cognitive capabilities, encompassing visual understanding, code generation, and cross-modal reasoning. The evaluation of 3 proprietary models and 11 open-weight models highlights the substantial challenges posed by ChartMimic. Even the advanced GPT-4V, Claude-3-opus only achieve an average score of 73.2 and 53.7, respectively, indicating significant room for improvement. We anticipate that ChartMimic will inspire the development of LMMs, advancing the pursuit of artificial general intelligence.

PDF HTML Abstract

ChartMimic: Evaluating LMMs' Cross-Modal Reasoning Capability via Chart-to-Code Generation

The paper introduces ChartMimic, a new benchmark designed to evaluate the visually-grounded code generation capabilities of large multimodal models (LMMs). Unlike the majority of existing benchmarks which rely solely on textual inputs, ChartMimic leverages information-intensive visual charts accompanied by textual instructions. The benchmark challenges LMMs to generate accurate code for rendering these charts, demanding an overview of visual understanding, code generation, and cross-modal reasoning abilities.

Benchmark Overview

ChartMimic is composed of 1,000 human-curated triplets of figures, instructions, and corresponding code. These data points are extracted from scientific papers encompassing various domains such as Physics, Computer Science, and Economics. The charts span 18 regular types and 4 advanced types, which are further diversified into 191 subcategories. This extensive diversity ensures the benchmark provides a comprehensive evaluation of LMM capabilities in generating code from complex and varied visual inputs.

Evaluation Metrics

To thoroughly assess the performance of LMMs on ChartMimic, the authors propose multi-level evaluation metrics. These metrics include both high-level and low-level assessments. The high-level metric (GPT-4V Score) relies on GPT-4V to evaluate the visual similarity between the rendered and ground-truth figures, while low-level metrics encompass text, layout, type, and color scores. These multi-faceted metrics allow for a detailed evaluation of code accuracy and visual fidelity, providing insights into different aspects of the models' cross-modal reasoning abilities.

Model Performance

The paper benchmarks 14 LMMs, including 3 proprietary models (GPT-4V, Claude-3-opus, GeminiProVision) and 11 open-weight models (e.g., LLaVA-Next-Vicuna-7B, Phi-3-Vision). The evaluation reveals a substantial performance disparity between open-weight and proprietary models. Specifically, GPT-4V outperforms all other models, achieving an average overall score of 71.4 for the Direct Mimic task and 72.33 for the Customized Mimic task. In contrast, the best-performing open-weight model, Phi-3-Vision, scores significantly lower (31.9 and 40.18, respectively). This highlights the substantial challenges posed by ChartMimic and indicates significant room for improvement in the open-source LMM community.

Error Analysis

The paper includes a comprehensive error analysis, categorizing errors into code-related, text-related, type-related, and color-related issues. The most prevalent errors stem from dimension issues in code (e.g., incorrect data dimensions), missing text elements, and misinterpreted chart types. These insights emphasize the need for improved model capabilities in understanding and accurately reproducing the nuanced visual elements and data relationships within the charts.

Implications and Future Directions

The introduction of ChartMimic has several implications for the development of LMMs and the pursuit of artificial general intelligence (AGI). By emphasizing the necessity for advanced cross-modal reasoning, ChartMimic pushes the boundaries of current model capabilities, highlighting both strengths and areas for improvement. The benchmark’s comprehensive evaluation framework not only offers a robust tool for researchers to assess and enhance their models but also encourages the exploration of innovative techniques to bridge the performance gap between open-weight and proprietary models.

Future research may focus on various aspects such as refining prompt strategies for multimodal reasoning, enhancing data pre-processing and augmentation techniques, and developing more sophisticated model architectures. Additionally, expanding the benchmark to include more diverse and complex visual inputs could further challenge and advance the field of LMM development.

Conclusion

ChartMimic provides a rigorous and multifaceted benchmark for evaluating the cross-modal reasoning capabilities of LMMs in the context of chart-to-code generation. By incorporating diverse and information-intensive visual inputs, along with a robust evaluation framework, ChartMimic sets a high bar for future advancements in the field. The benchmark’s insights and detailed error analysis present valuable opportunities for researchers to innovate and improve large multimodal models, driving forward the quest for AGI.

PDF Markdown Bookmark Chat (Pro)

Authors (14)

Chufan Shi (15 papers)
Cheng Yang (168 papers)
Yaxin Liu (17 papers)
Bo Shui (4 papers)
Junjie Wang (164 papers)
Mohan Jing (4 papers)
Linran Xu (2 papers)
Xinyu Zhu (28 papers)
Siheng Li (20 papers)
Yuxiang Zhang (104 papers)
Gongye Liu (7 papers)
Xiaomei Nie (2 papers)
Deng Cai (181 papers)
Yujiu Yang (155 papers)

Citations (8)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/tianhongzxy/status/1802633258980778399

https://twitter.com/ChengYANG_yc/status/1802629538025001009

https://twitter.com/jwt0625/status/1890671690822349024

https://twitter.com/JfkWhitlam/status/1802985348227817569

https://twitter.com/gm8xx8/status/1802523155983937738