Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Visualization Generation with Large Language Models: An Evaluation (2401.11255v1)

Published 20 Jan 2024 in cs.HC

Abstract: Analysts frequently need to create visualizations in the data analysis process to obtain and communicate insights. To reduce the burden of creating visualizations, previous research has developed various approaches for analysts to create visualizations from natural language queries. Recent studies have demonstrated the capabilities of LLMs in natural language understanding and code generation tasks. The capabilities imply the potential of using LLMs to generate visualization specifications from natural language queries. In this paper, we evaluate the capability of a LLM to generate visualization specifications on the task of natural language to visualization (NL2VIS). More specifically, we have opted for GPT-3.5 and Vega-Lite to represent LLMs and visualization specifications, respectively. The evaluation is conducted on the nvBench dataset. In the evaluation, we utilize both zero-shot and few-shot prompt strategies. The results demonstrate that GPT-3.5 surpasses previous NL2VIS approaches. Additionally, the performance of few-shot prompts is higher than that of zero-shot prompts. We discuss the limitations of GPT-3.5 on NL2VIS, such as misunderstanding the data attributes and grammar errors in generated specifications. We also summarized several directions, such as correcting the ground truth and reducing the ambiguities in natural language queries, to improve the NL2VIS benchmark.

The paper "Visualization Generation with LLMs: An Evaluation" explores the potential of using LLMs, specifically GPT-3.5, to generate visualization specifications from natural language queries. This topic is important because data visualization is a key part of data analysis, and automating this process can significantly streamline analytical workflows for researchers and professionals who may not be experts in visualization design but need to communicate insights effectively.

Background and Relevance

Data visualization helps in uncovering patterns and communicating insights from data analysis. Creating effective visualizations is traditionally a skill-intensive task, requiring knowledge of visualization design principles. Automating this process using natural language queries can save time and effort, allowing analysts to focus on insights rather than the mechanics of visualization. This paper evaluates the capability of LLMs to automate this process, using natural language processing to produce visualization specifications.

Explanation of Key Concepts

  1. Natural Language to Visualization (NL2VIS): This task involves converting plain language descriptions into graphical data representations. The evaluation focuses on seeing how well GPT-3.5, an advanced LLM, can handle this conversion using the Vega-Lite grammar, which is a popular visualization tool.
  2. Prompt Strategies: The paper examines different strategies for prompting the LLM. Two key strategies are compared:
    • Zero-shot prompts: The LLM is given no previous examples or guidance, relying purely on its pre-existing LLM capabilities.
    • Few-shot prompts: Providing the model with some example queries and corresponding visualizations to guide its responses.
  3. nvBench Dataset: This benchmark dataset is used to evaluate the LLM's performance. It contains a large collection of natural language queries mapped to visualization tasks.

Evaluation Process

The evaluation uses comparison metrics to assess the accuracy of visualizations generated by GPT-3.5. These visualizations are compared to predefined correct results based on their visual content and underlying data structures.

  • Matching Accuracy: This assesses whether the generated visualizations match the expected output. Two methods were used:
    • Pixel-based method: Compares visuals on a pixel-by-pixel basis, a very strict measure.
    • SVG-JSON-based method: Compares the logical data representation and types of charts to avoid trivial mismatches due to minor graphical inconsistencies.

Findings and Recommendations

  1. Performance of LLM: The few-shot prompting strategy significantly improved performance over the zero-shot approach, indicating that example-based learning enables the LLM to handle complex queries better.
  2. Common Errors: Despite promising results, the LLM sometimes misinterprets data attributes or makes grammatical errors in Vega-Lite specifications. Clearer guidance on these areas could further improve performance.
  3. Improving Benchmarks: Some inconsistencies were found in the nvBench dataset itself, such as queries with ambiguous chart types or unstated time units. To enhance benchmarks for future evaluations, clearer task descriptions and correct mapping instructions should be ensured.
  4. Potential for Linting Tools: Developing tools that check and correct Vega-Lite syntax could further refine LLM outputs, offering a practical pathway to reduce errors in specification generation.

Overall, the evaluation highlights both the potential and the current limitations of using LLMs for visualization automation. The findings point to opportunities for improving both the LLMs through better training data and enhanced benchmarks for more accurate evaluation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. J. D. Mackinlay, “Automating the design of graphical presentations of relational information,” ACM Transactions on Graphics, vol. 5, no. 2, pp. 110–141, 1986.
  2. J. Mackinlay, P. Hanrahan, and C. Stolte, “Show me: Automatic presentation for visual analysis,” IEEE Transactions on Visualization and Computer Graphics, vol. 13, no. 6, pp. 1137–1144, 2007.
  3. Y. Sun, J. Leigh, A. E. Johnson, and S. Lee, “Articulate: A semi-automated model for translating natural language queries into meaningful visualizations,” in Proc. Int. Symp. Smart Graphics (SG), ser. Lecture Notes in Computer Science, vol. 6133, 2010, pp. 184–195.
  4. “The stanford parser,” http://nlp.stanford.edu/software/lex-parser.shtml.
  5. Y. Luo, X. Qin, N. Tang, G. Li, and X. Wang, “DeepEye: Creating good data visualizations by keyword search,” in Proc. Int. ACM Conf. Management of Data (SIGMOD), 2018, pp. 1733–1736.
  6. “Apache opennlp,” 2017, http://opennlp.apache.org.
  7. A. Narechania, A. Srinivasan, and J. T. Stasko, “NL4DV: A toolkit for generating analytic specifications for data visualization from natural language queries,” IEEE Transactions on Visualization and Computer Graphics, vol. 27, no. 2, pp. 369–379, 2021.
  8. C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. McClosky, “The Stanford CoreNLP natural language processing toolkit,” in Proc. Con. Association for Computational Linguistics (ACL), 2014, pp. 55–60.
  9. L. Shen, E. Shen, Y. Luo, X. Yang, X. Hu, X. Zhang, Z. Tai, and J. Wang, “Towards natural language interfaces for data visualization: A survey,” IEEE Transactions on Visualization and Computer Graphics, vol. 29, no. 6, pp. 3121–3144, 2023.
  10. C. Liu, Y. Han, R. Jiang, and X. Yuan, “ADVISor: Automatic visualization answer for natural-language question on tabular data,” in Proc. IEEE Pacific Visualization Symposium (PacificVis), 2021, pp. 11–20.
  11. Y. Luo, N. Tang, G. Li, J. Tang, C. Chai, and X. Qin, “Natural language to visualization by neural machine translation,” IEEE Transactions on Visualization and Computer Graphics, vol. 28, no. 1, pp. 217–226, 2022.
  12. J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “GPT-4 technical report,” preprint arXiv:2303.08774, 2023.
  13. W. Yang, M. Liu, Z. Wang, and S. Liu, “Foundation models meet visualizations: Challenges and opportunities,” Computational Visual Media, 2024, preprint arXiv:2310.05771.
  14. D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt, “Measuring coding challenge competence with APPS,” preprint arXiv:2105.09938, 2021.
  15. H. Liu, R. Ning, Z. Teng, J. Liu, Q. Zhou, and Y. Zhang, “Evaluating the logical reasoning ability of chatgpt and GPT-4,” preprint arXiv:2304.03439, 2023.
  16. Z. Yuan, H. Yuan, C. Tan, W. Wang, and S. Huang, “How well do large language models perform in arithmetic tasks?” preprint arXiv:2304.02015, 2023.
  17. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837, 2022.
  18. W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks,” preprint arXiv:2211.12588, 2022.
  19. D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet, Q. V. Le, and E. H. Chi, “Least-to-most prompting enables complex reasoning in large language models,” in Proc. Int. Con. Learning Representations (ICLR), 2023.
  20. P. Maddigan and T. Susnjak, “Chat2VIS: Generating data visualizations via natural language using chatgpt, codex and GPT-3 large language models,” IEEE Access, vol. 11, pp. 45 181–45 193, 2023.
  21. V. Dibia, “LIDA: A tool for automatic generation of grammar-agnostic visualizations and infographics using large language models,” in Proc. Con. Association for Computational Linguistics (ALC), 2023, pp. 113–126.
  22. P. Maddigan and T. Susnjak, “Chat2VIS: Fine-tuning data visualisations using multilingual natural language text and pre-trained large language models,” preprint arXiv:2303.14292, 2023.
  23. X. Pu, M. Kay, S. M. Drucker, J. Heer, D. Moritz, and A. Satyanarayan, “Special interest group on visualization grammars,” in Extended Abstacts of ACM Conf. Human Factors in Computing Systems, 2021.
  24. A. Satyanarayan, D. Moritz, K. Wongsuphasawat, and J. Heer, “Vega-lite: A grammar of interactive graphics,” IEEE Transactions on Visualization and Computer Graphics, vol. 23, no. 1, pp. 341–350, 2017.
  25. Y. Luo, N. Tang, G. Li, C. Chai, W. Li, and X. Qin, “Synthesizing natural language to visualization (NL2VIS) benchmarks from NL2SQL benchmarks,” in Proc. Int. ACM Conf. Management of Data (SIGMOD), 2021, pp. 1235–1247.
  26. K. C. Cox, R. E. Grinter, S. Hibino, L. J. Jagadeesan, and D. Mantilla, “A multi-modal natural language interface to an information visualization environment,” International Journal of Speech Technology, vol. 4, no. 3, pp. 297–314, 2001.
  27. T. Gao, M. Dontcheva, E. Adar, Z. Liu, and K. G. Karahalios, “DataTone: Managing ambiguity in natural language interfaces for data visualization,” in Proc. ACM Conf. Symposium on User Interface Software and Technology (UIST), 2015, pp. 489–500.
  28. V. Setlur, S. E. Battersby, M. Tory, R. Gossweiler, and A. X. Chang, “Eviza: A natural language interface for visual analysis,” in Proc. ACM Conf. Symposium on User Interface Software and Technology (UIST), 2016, pp. 365–377.
  29. T. J. Parr and R. W. Quong, “ANTLR: A predicated-LL(k) parser generator,” Software: Practice and Experience, vol. 25, no. 7, pp. 789–810, 1995.
  30. B. Yu and C. T. Silva, “FlowSense: A natural language interface for visual data exploration within a dataflow system,” IEEE Transactions on Visualization and Computer Graphics, vol. 26, no. 1, pp. 1–11, 2020.
  31. Y. Zhang, P. Pasupat, and P. Liang, “Macro grammars and holistic triggering for efficient semantic parsing,” in Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), 2017, pp. 1214–1223.
  32. Y. Song, X. Zhao, R. C. Wong, and D. Jiang, “Rgvisnet: A hybrid retrieval-generation neural framework towards automatic data visualization generation,” in Proc. Int. ACM Conf. Management of Data (SIGMOD), 2022, pp. 1646–1655.
  33. Q. Chen, S. Pailoor, C. Barnaby, A. Criswell, C. Wang, G. Durrett, and I. Dillig, “Type-directed synthesis of visualizations from natural language queries,” Proc. ACM Program. Lang., vol. 6, no. OOPSLA2, pp. 532–559, 2022.
  34. J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proc. Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies(NAACL-HLT), 2019, pp. 4171–4186.
  35. A. Srinivasan, N. Nyapathy, B. Lee, S. M. Drucker, and J. T. Stasko, “Collecting and characterizing natural language utterances for specifying data visualizations,” in Proc. ACM Conf. Human Factors in Computing Systems (CHI), 2021, pp. 464:1–464:10.
  36. C. Wang, J. Thompson, and B. Lee, “Data Formulator: Ai-powered concept-driven visualization authoring,” preprint arXiv:2309.10094, 2023.
  37. L. Wang, S. Zhang, Y. Wang, E. Lim, and Y. Wang, “LLM4Vis: Explainable visualization recommendation using chatgpt,” preprint arXiv:2310.07652, 2023.
  38. H. Ko, H. Jeon, G. Park, D. H. Kim, N. W. Kim, J. Kim, and J. Seo, “Natural language dataset generation framework for visualizations powered by large language models,” preprint arXiv:2309.10245, 2023.
  39. F. Cassano, J. Gouwar, D. Nguyen, S. Nguyen, L. Phipps-Costin, D. Pinckney, M.-H. Yee, Y. Zi, C. J. Anderson, M. Q. Feldman et al., “MultiPL-E: A scalable and extensible approach to benchmarking neural code generation,” preprint arXiv:2208.08227, 2022.
  40. J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” preprint arXiv:2305.01210, 2023.
  41. H. Ding, V. Kumar, Y. Tian, Z. Wang, R. Kwiatkowski, X. Li, M. K. Ramanathan, B. Ray, P. Bhatia, and S. Sengupta, “A static evaluation of code completion by large language models,” in Proc. Con. Association for Computational Linguistics (ACL), 2023, pp. 347–360.
  42. Y. Fu, L. Ou, M. Chen, Y. Wan, H. Peng, and T. Khot, “Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance,” preprint arXiv:abs/2305.17306, 2023.
  43. F. Xu, Q. Lin, J. Han, T. Zhao, J. Liu, and E. Cambria, “Are large language models really good logical reasoners? a comprehensive evaluation from deductive, inductive and abductive views,” preprint arXiv:2306.09841, 2023.
  44. T. Wei, J. Luan, W. Liu, S. Dong, and B. Wang, “CMATH: can your language model pass chinese elementary school math test?” 2023.
  45. S. Frieder, L. Pinchetti, A. Chevalier, R.-R. Griffiths, T. Salvatori, T. Lukasiewicz, P. C. Petersen, and J. Berner, “Mathematical Capabilities of ChatGPT,” preprint arXiv:2301.13867, 2023.
  46. Y. Wu, F. Jia, S. Zhang, H. Li, E. Zhu, Y. Wang, Y. T. Lee, R. Peng, Q. Wu, and C. Wang, “An empirical study on challenging math problem solving with GPT-4,” preprint arXiv:2306.01337, 2023.
  47. K. M. Collins, A. Q. Jiang, S. Frieder, L. Wong, M. Zilka, U. Bhatt, T. Lukasiewicz, Y. Wu, J. B. Tenenbaum, W. Hart, T. Gowers, W. Li, A. Weller, and M. Jamnik, “Evaluating language models for mathematics through interactions,” preprint arXiv:2306.01694, 2023.
  48. X. Dao and N. Le, “Investigating the effectiveness of chatgpt in mathematical reasoning and problem solving: Evidence from the vietnamese national high school graduation examination,” preprint arXiv:2306.06331, 2023.
  49. T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” Advances in neural information processing systems, vol. 35, pp. 22 199–22 213, 2022.
  50. Y. Fu, H. Peng, A. Sabharwal, P. Clark, and T. Khot, “Complexity-based prompting for multi-step reasoning,” in Proc. Int. Con. Learning Representations (ICLR), 2023.
  51. P. Lu, L. Qiu, K. Chang, Y. N. Wu, S. Zhu, T. Rajpurohit, P. Clark, and A. Kalyan, “Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning,” in Proc. Int. Con. Learning Representations (ICLR), 2023.
  52. X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” in Proc. Int. Con. Learning Representations (ICLR), 2023.
  53. Y. Weng, M. Zhu, F. Xia, B. Li, S. He, K. Liu, and J. Zhao, “Large Language Models are Better Reasoners with Self-Verification,” preprint arXiv:2212.09561, 2023.
  54. Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J.-G. Lou, and W. Chen, “Making language models better reasoners with step-aware verifier,” in Proc. Con. Association for Computational Linguistics (ACL), 2023, pp. 5315–5333.
  55. A. Creswell, M. Shanahan, and I. Higgins, “Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning,” arXiv:2205.09712, 2022.
  56. L. Wang, W. Xu, Y. Lan, Z. Hu, Y. Lan, R. K. Lee, and E. Lim, “Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,” in Proc. Con. Association for Computational Linguistics (ACL), 2023, pp. 2609–2634.
  57. P. Yin, W. Li, K. Xiao, A. Rao, Y. Wen, K. Shi, J. Howland, P. Bailey, M. Catasta, H. Michalewski, O. Polozov, and C. Sutton, “Natural language to code generation in interactive data science notebooks,” in Proc. Con. Association for Computational Linguistics (ACL), 2023, pp. 126–173.
  58. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Conf. Neural Information Processing Systems (NeurIPS), 2017, pp. 5998–6008.
  59. “Vega-lite document,” https://vega.github.io/vega-lite/docs/.
  60. J. Mackinlay, “Automating the design of graphical presentations of relational information,” ACM Transactions on Graphics, vol. 5, no. 2, p. 110–141, 1986.
  61. Y. Kim and J. Heer, “Assessing effects of task and data distribution on the effectiveness of visual encodings,” Computer Graphics Forum, vol. 37, no. 3, pp. 157–167, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Guozheng Li (19 papers)
  2. Xinyu Wang (186 papers)
  3. Gerile Aodeng (1 paper)
  4. Shunyuan Zheng (6 papers)
  5. Yu Zhang (1399 papers)
  6. Chuangxin Ou (2 papers)
  7. Song Wang (313 papers)
  8. Chi Harold Liu (43 papers)
Citations (18)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets