Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots (2405.07990v1)

Published 13 May 2024 in cs.CL and cs.CV
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots

Abstract: The remarkable progress of Multi-modal LLMs (MLLMs) has attracted significant attention due to their superior performance in visual contexts. However, their capabilities in turning visual figure to executable code, have not been evaluated thoroughly. To address this, we introduce Plot2Code, a comprehensive visual coding benchmark designed for a fair and in-depth assessment of MLLMs. We carefully collect 132 manually selected high-quality matplotlib plots across six plot types from publicly available matplotlib galleries. For each plot, we carefully offer its source code, and an descriptive instruction summarized by GPT-4. This approach enables Plot2Code to extensively evaluate MLLMs' code capabilities across various input modalities. Furthermore, we propose three automatic evaluation metrics, including code pass rate, text-match ratio, and GPT-4V overall rating, for a fine-grained assessment of the output code and rendered images. Instead of simply judging pass or fail, we employ GPT-4V to make an overall judgement between the generated and reference images, which has been shown to be consistent with human evaluation. The evaluation results, which include analyses of 14 MLLMs such as the proprietary GPT-4V, Gemini-Pro, and the open-sourced Mini-Gemini, highlight the substantial challenges presented by Plot2Code. With Plot2Code, we reveal that most existing MLLMs struggle with visual coding for text-dense plots, heavily relying on textual instruction. We hope that the evaluation results from Plot2Code on visual coding will guide the future development of MLLMs. All data involved with Plot2Code are available at https://huggingface.co/datasets/TencentARC/Plot2Code.

Understanding "Plot2Code": Evaluating MLLMs in Code Generation from Visual Inputs

Introduction to the Study

In recent years, the fusion of visual processing and LLMs has birthed Multi-modal LLMs (MLLMs). These advanced AI models are capable of understanding and generating responses based on both text and image inputs. However, one challenging aspect remains relatively underexplored: the ability of these models to turn complex visual data, like graphs or plots, into executable code. The paper introduces "Plot2Code," a benchmark designed specifically to evaluate the performance of MLLMs in converting matplotlib plot images into source code.

What is Plot2Code?

"Plot2Code" is not just another dataset. It's a meticulously crafted benchmark containing 132 high-quality matplotlib plots, selected to specifically challenge the MLLMs in diverse visual scenarios. Each plot in the dataset is paired with its source code and a descriptive instruction created by GPT-4, allowing comprehensive testing across various plot types and complexities.

How Does Plot2Code Work?

The authors of the paper designed Plot2Code with two main evaluation settings:

  1. Direct Asking: The model receives only the image of the plot and must generate the source code to recreate it.
  2. Conditional Asking: The model is given the plot image along with textual instructions, which detail specifics about the plot that must be reflected in the generated code.

These settings help examine how well models can generate accurate and executable code based purely on visual input, as well as how they handle additional textual descriptions.

Key Findings from the Study

The evaluation of 14 different MLLMs using Plot2Code revealed several fascinating insights:

  • The top-performing models in the paper were GPT-4V and Claude-3, with GPT-4V achieving a high score of 7.68 (out of 10) in terms of overall performance in the Conditional Asking setting.
  • Across the board, MLLMs struggled more with Direct Asking compared to Conditional Asking. This suggests that textual instructions play a significant role in guiding the models toward correct code generation.
  • Text-dense plots (plots with a lot of textual information) posed a significant challenge for most models, indicating a potential area for future improvement.

Practical Implications

The results from Plot2Code provide several practical implications for the development of MLLMs:

  • Accuracy in Code Generation: The ability to generate executable code from visual inputs can significantly streamline tasks like automated report generation, data analysis, and more, particularly in data-driven fields like statistics and data science.
  • Model Training and Improvement: Insights from the Plot2Code assessments can help researchers and developers understand current limitations and enhance model training procedures, potentially leading to more robust MLLMs.

Speculations on Future Developments

Looking forward, Plot2Code could drive several advancements in AI:

  • Enhanced Multi-modal Understanding: This benchmark could spur further research into improving the multi-modal capabilities of AI models, ensuring they understand and process combined data forms (textual, visual) more effectively.
  • Development of Specialized Models: We might see the rise of specialized MLLMs that excel in specific domains like scientific visualization or technical diagrams.

Conclusion

Plot2Code represents a significant step in testing and enhancing the capabilities of multi-modal LLMs in a practical, challenging area of AI: generating code from visual data. While the results indicate room for improvement, particularly in handling plots with dense textual data without supplemental text instructions, they also highlight the considerable potential of current models and set a pathway for future advancements.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Anthropic. Claude 3 haiku: our fastest model yet. 2024. Available at: https://www.anthropic.com/news/claude-3-haiku.
  2. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  3. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023.
  6. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  7. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023.
  8. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396, 2024.
  9. Google Gemini Team. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  10. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
  11. Vary: Scaling up the vision vocabulary for large vision-language models. arXiv preprint arXiv:2312.06109, 2023.
  12. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  13. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  14. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  15. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  16. Seed-bench-2: Benchmarking multimodal large language models. arXiv preprint arXiv:2311.17092, 2023.
  17. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023.
  18. Mmcode: Evaluating multi-modal code large language models with visually rich programming problems. arXiv preprint arXiv:2404.09486, 2024.
  19. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
  20. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv:2403.18814, 2023.
  21. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
  22. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  23. Deepseek-vl: Towards real-world vision-language understanding, 2024.
  24. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022.
  25. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021.
  26. Svgeditbench: A benchmark dataset for quantitative assessment of llm’s svg editing capabilities. arXiv preprint arXiv:2404.13710, 2024.
  27. OpenAI. Chatgpt. https://chat.openai.com, 2023.
  28. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  29. OpenAI. GPT-4V(ision) system card, 2023.
  30. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  31. Design2code: How far are we from automating front-end engineering? arXiv preprint arXiv:2403.03163, 2024.
  32. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  33. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  34. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091, 2023.
  35. Mint: Evaluating llms in multi-turn interaction with tools and language feedback. arXiv preprint arXiv:2309.10691, 2023.
  36. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  37. Llama pro: Progressive llama with block expansion. arXiv preprint arXiv:2401.02415, 2024.
  38. Intercode: Standardizing and benchmarking interactive coding with execution feedback. Advances in Neural Information Processing Systems, 36, 2024.
  39. Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi. arXiv preprint arXiv:2404.16006, 2024.
  40. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024.
  41. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
  42. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
  43. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2024.
  44. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Chengyue Wu (22 papers)
  2. Yixiao Ge (99 papers)
  3. Qiushan Guo (12 papers)
  4. Jiahao Wang (88 papers)
  5. Zhixuan Liang (14 papers)
  6. Zeyu Lu (16 papers)
  7. Ying Shan (252 papers)
  8. Ping Luo (340 papers)
Citations (8)