Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Natural Language Dataset Generation Framework for Visualizations Powered by Large Language Models (2309.10245v4)

Published 19 Sep 2023 in cs.HC

Abstract: We introduce VL2NL, a LLM framework that generates rich and diverse NL datasets using only Vega-Lite specifications as input, thereby streamlining the development of Natural Language Interfaces (NLIs) for data visualization. To synthesize relevant chart semantics accurately and enhance syntactic diversity in each NL dataset, we leverage 1) a guided discovery incorporated into prompting so that LLMs can steer themselves to create faithful NL datasets in a self-directed manner; 2) a score-based paraphrasing to augment NL syntax along with four language axes. We also present a new collection of 1,981 real-world Vega-Lite specifications that have increased diversity and complexity than existing chart collections. When tested on our chart collection, VL2NL extracted chart semantics and generated L1/L2 captions with 89.4% and 76.0% accuracy, respectively. It also demonstrated generating and paraphrasing utterances and questions with greater diversity compared to the benchmarks. Last, we discuss how our NL datasets and framework can be utilized in real-world scenarios. The codes and chart collection are available at https://github.com/hyungkwonko/chart-LLM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (98)
  1. Interactive Exploration and Refinement of Facial Expression Using Manifold Learning. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology (Virtual Event, USA) (UIST ’20). Association for Computing Machinery, New York, NY, USA, 778–790. https://doi.org/10.1145/3379337.3415877
  2. Low-level components of analytic activity in information visualization. In IEEE Symposium on Information Visualization, 2005. INFOVIS 2005. IEEE, 111–117.
  3. EmoBalloon-Conveying Emotional Arousal in Text Chats with Speech Balloons. In CHI Conference on Human Factors in Computing Systems. 1–16.
  4. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073 (2022).
  5. Soylent: a word processor with a crowd inside. In Proceedings of the 23nd annual ACM symposium on User interface software and technology. 313–322.
  6. What makes a visualization memorable? IEEE transactions on visualization and computer graphics 19, 12 (2013), 2306–2315.
  7. Ann L Brown and Joseph C Campione. 1994. Guided discovery in a community of learners. The MIT Press.
  8. Readings in information visualization: using vision to think. Morgan Kaufmann.
  9. Harrison Chase. 2022. LangChain. https://github.com/hwchase17/langchain
  10. Leaf-qa: Locate, encode & attend for figure question answering. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3512–3521.
  11. Chen Chen and Zhicheng Liu. 2023. The State of the Art in Creating Visualization Corpora for Automated Chart Analysis. Computer Graphics Forum (2023). https://doi.org/10.1111/cgf.14855
  12. Composition and configuration patterns in multiple-view visualizations. IEEE Transactions on Visualization and Computer Graphics 27, 2 (2020), 1514–1524.
  13. A multi-modal natural language interface to an information visualization environment. International Journal of Speech Technology 4 (2001), 297–314.
  14. Chart mining: A survey of methods for automated chart analysis. IEEE transactions on pattern analysis and machine intelligence 43, 11 (2020), 3799–3819.
  15. Ton De Jong and Wouter R Van Joolingen. 1998. Scientific discovery learning with computer simulations of conceptual domains. Review of educational research 68, 2 (1998), 179–201.
  16. Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th annual ACM symposium on user interface software and technology. 845–854.
  17. Victor Dibia. 2023. LIDA: A Tool for Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models. arXiv preprint arXiv:2303.02927 (2023).
  18. Victor Dibia and Çağatay Demiralp. 2019. Data2vis: Automatic generation of data visualizations using sequence-to-sequence recurrent neural networks. IEEE computer graphics and applications 39, 5 (2019), 33–46.
  19. Quda: natural language queries for visual data analytics. arXiv preprint arXiv:2005.03257 (2020).
  20. Datatone: Managing ambiguity in natural language interfaces for data visualization. In Proceedings of the 28th annual acm symposium on user interface software & technology. 489–500.
  21. Evaluating large language models in generating synthetic hci research data: a case study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–19.
  22. Jonathan Harper and Maneesh Agrawala. 2017. Converting basic D3 charts into reusable style templates. IEEE transactions on visualization and computer graphics 24, 3 (2017), 1274–1286.
  23. Scaffolding and achievement in problem-based and inquiry learning: a response to Kirschner, Sweller, and. Educational psychologist 42, 2 (2007), 99–107.
  24. Chart question answering: State of the art and future directions. In Computer Graphics Forum, Vol. 41. Wiley Online Library, 555–572.
  25. Applying pragmatics principles for interaction with visual analytics. IEEE transactions on visualization and computer graphics 24, 1 (2017), 309–318.
  26. DIVE: A mixed-initiative system supporting integrated data exploration workflows. In Proceedings of the workshop on human-in-the-loop data analytics. 1–7.
  27. Towards Visualisation Specifications from Multilingual Natural Language Queries using Large Language Models. In EuroVis 2023 - Posters, Christina Gillmann, Michael Krone, and Simone Lenti (Eds.). The Eurographics Association. https://doi.org/10.2312/evp.20231072
  28. Survey of hallucination in natural language generation. Comput. Surveys 55, 12 (2023), 1–38.
  29. Chartsense: Interactive data extraction from chart images. In Proceedings of the 2017 chi conference on human factors in computing systems. 6706–6717.
  30. Profiler: Integrated statistical analysis and visualization for data quality assessment. In Proceedings of the International Working Conference on Advanced Visual Interfaces. 547–554.
  31. Chart-to-text: A large-scale benchmark for chart summarization. arXiv preprint arXiv:2203.06486 (2022).
  32. Answering questions about charts and generating visual explanations. In Proceedings of the 2020 CHI conference on human factors in computing systems. 1–13.
  33. Facilitating document reading by linking text and tables. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology. 423–434.
  34. Towards understanding how readers integrate charts and captions: A case study with line charts. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–11.
  35. Robert Kincaid and Graham Pollock. 2017. Nicky: Toward a virtual assistant for test and measurement instrument recommendations. In 2017 IEEE 11th International Conference on Semantic Computing (ICSC). IEEE, 196–203.
  36. Crowdforge: Crowdsourcing complex work. In Proceedings of the 24th annual ACM symposium on User interface software and technology. 43–52.
  37. Mark Klein and Ana Cristina Bicharra Garcia. 2015. High-speed idea filtering with the bag of lemons. Decision Support Systems 78 (2015), 39–50.
  38. We-toon: A Communication Support System between Writers and Artists in Collaborative Webtoon Sketch Revision. In The 35th Annual ACM Symposium on User Interface Software and Technology. 1–14.
  39. Extracting references between text and charts via crowdsourcing. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems. 31–40.
  40. Diverse interaction recommendation for public users exploring multi-view visualization using deep learning. IEEE Transactions on Visualization and Computer Graphics 29, 1 (2022), 95–105.
  41. FlexKBQA: A Flexible LLM-Powered Framework for Few-Shot Knowledge Base Question Answering. arXiv preprint arXiv:2308.12060 (2023).
  42. Advisor: Automatic visualization answer for natural-language question on tabular data. In 2021 IEEE 14th Pacific Visualization Symposium (PacificVis). IEEE, 11–20.
  43. DePlot: One-shot visual language reasoning by plot-to-table translation. arXiv preprint arXiv:2212.10505 (2022).
  44. Matcha: Enhancing visual language pretraining with math reasoning and chart derendering. arXiv preprint arXiv:2212.09662 (2022).
  45. Alan Lundgard and Arvind Satyanarayan. 2021. Accessible visualization via natural language descriptions: A four-level model of semantic content. IEEE transactions on visualization and computer graphics 28, 1 (2021), 1073–1083.
  46. Deepeye: Towards automatic data visualization. In 2018 IEEE 34th international conference on data engineering (ICDE). IEEE, 101–112.
  47. Synthesizing natural language to visualization (NL2VIS) benchmarks from NL2SQL benchmarks. In Proceedings of the 2021 International Conference on Management of Data. 1235–1247.
  48. Natural language to visualization by neural machine translation. IEEE Transactions on Visualization and Computer Graphics 28, 1 (2021), 217–226.
  49. Jock Mackinlay. 1986. Automating the design of graphical presentations of relational information. Acm Transactions On Graphics (Tog) 5, 2 (1986), 110–141.
  50. Show me: Automatic presentation for visual analysis. IEEE transactions on visualization and computer graphics 13, 6 (2007), 1137–1144.
  51. LineCap: Line Charts for Data Visualization Captioning Models. In 2022 IEEE Visualization and Visual Analytics (VIS). IEEE, 35–39.
  52. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244 (2022).
  53. hdbscan: Hierarchical density based clustering. J. Open Source Softw. 2, 11 (2017), 205.
  54. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018).
  55. Generating training data with language models: Towards zero-shot language understanding. Advances in Neural Information Processing Systems 35 (2022), 462–477.
  56. Guiding novice web workers in making image descriptions using templates. ACM Transactions on Accessible Computing (TACCESS) 7, 4 (2015), 1–21.
  57. Formalizing visualization design knowledge as constraints: Actionable and extensible models in draco. IEEE transactions on visualization and computer graphics 25, 1 (2018), 438–448.
  58. GANSpiration: Balancing Targeted and Serendipitous Inspiration in User Interface Design with Style-Based Generative Adversarial Network. In CHI Conference on Human Factors in Computing Systems. 1–15.
  59. DIY: Assessing the correctness of natural language to sql systems. In 26th International Conference on Intelligent User Interfaces. 597–607.
  60. NL4DV: A toolkit for generating analytic specifications for data visualization from natural language queries. IEEE Transactions on Visualization and Computer Graphics 27, 2 (2020), 369–379.
  61. Jason Obeid and Enamul Hoque. 2020. Chart-to-text: Generating natural language descriptions for charts by adapting the transformer model. arXiv preprint arXiv:2010.09142 (2020).
  62. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442 (2023).
  63. Jorge Poco and Jeffrey Heer. 2017. Reverse-engineering visualizations: Recovering visual encodings from chart images. In Computer graphics forum, Vol. 36. Wiley Online Library, 353–363.
  64. Generating accurate caption units for figure captioning. In Proceedings of the Web Conference 2021. 2792–2804.
  65. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/1908.10084
  66. Directed diversity: Leveraging language embedding distances for collective creativity in crowd ideation. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–35.
  67. CLASP: Few-shot cross-lingual data augmentation for semantic parsing. arXiv preprint arXiv:2210.07074 (2022).
  68. Vega-Lite: A Grammar of Interactive Graphics. IEEE Transactions on Visualization & Computer Graphics (Proc. InfoVis) (2017). https://doi.org/10.1109/tvcg.2016.2599030
  69. Vega Editor.
  70. Vega-Lite gallery.
  71. Reactive vega: A streaming dataflow architecture for declarative interactive visualization. IEEE transactions on visualization and computer graphics 22, 1 (2015), 659–668.
  72. Timo Schick and Hinrich Schütze. 2021. Generating datasets with pretrained language models. arXiv preprint arXiv:2104.07540 (2021).
  73. Eviza: A natural language interface for visual analysis. In Proceedings of the 29th annual symposium on user interface software and technology. 365–377.
  74. Towards natural language interfaces for data visualization: A survey. IEEE transactions on visualization and computer graphics (2022).
  75. Data player: Automatic generation of data videos with narration-animation interplay. arXiv preprint arXiv:2308.04703 (2023).
  76. IdeaHound: improving large-scale collaborative ideation with crowd-powered real-time semantic modeling. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology. 609–624.
  77. Andrea Spreafico and Giuseppe Carenini. 2020. Neural data-driven captioning of time-series line charts. In Proceedings of the International Conference on Advanced Visual Interfaces. 1–5.
  78. Augmenting visualizations with interactive data facts to facilitate interpretation and communication. IEEE transactions on visualization and computer graphics 25, 1 (2018), 672–681.
  79. Collecting and characterizing natural language utterances for specifying data visualizations. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–10.
  80. Arjun Srinivasan and John Stasko. 2017a. Natural language interfaces for data analysis with visualization: Considering what has and could be asked. In Proceedings of the Eurographics/IEEE VGTC conference on visualization: Short papers. 55–59.
  81. Arjun Srinivasan and John Stasko. 2017b. Orko: Facilitating multimodal interaction for visual exploration and analysis of networks. IEEE transactions on visualization and computer graphics 24, 1 (2017), 511–521.
  82. Polaris: A system for query, analysis, and visualization of multidimensional relational databases. IEEE Transactions on Visualization and Computer Graphics 8, 1 (2002), 52–65.
  83. Nicole Sultanum and Arjun Srinivasan. 2023. DataTales: Investigating the use of Large Language Models for Authoring Data-Driven Articles. arXiv preprint arXiv:2308.04076 (2023).
  84. VisText: A Benchmark for Semantically Rich Chart Captioning. In The Annual Meeting of the Association for Computational Linguistics (ACL). http://vis.csail.mit.edu/pubs/vistext
  85. Seedb: Efficient data-driven visualization recommendations to support visual analytics. In Proceedings of the VLDB Endowment International Conference on Very Large Data Bases, Vol. 8. NIH Public Access, 2182.
  86. Towards Natural Language-Based Visualization Authoring. IEEE Transactions on Visualization and Computer Graphics 29, 1 (2022), 1222–1232.
  87. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
  88. Voyager: Exploratory analysis via faceted browsing of visualization recommendations. IEEE transactions on visualization and computer graphics 22, 1 (2015), 649–658.
  89. Towards a general-purpose query language for visualization recommendation. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics. 1–6.
  90. Voyager 2: Augmenting visual analysis with partial view specifications. In Proceedings of the 2017 chi conference on human factors in computing systems. 2648–2659.
  91. WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–14.
  92. Yumo Xu and Shay B Cohen. 2018. Stock movement prediction from tweets and historical prices. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1970–1979.
  93. Byt5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics 10 (2022), 291–306.
  94. ReAct: Synergizing Reasoning and Acting in Language Models. In International Conference on Learning Representations (ICLR).
  95. Zerogen: Efficient zero-shot learning via dataset generation. arXiv preprint arXiv:2202.07922 (2022).
  96. Generating Data for Symbolic Language with Large Language Models. arXiv preprint arXiv:2305.13917 (2023).
  97. Bowen Yu and Cláudio T Silva. 2019. FlowSense: A natural language interface for visual data exploration within a dataflow system. IEEE transactions on visualization and computer graphics 26, 1 (2019), 1–11.
  98. Chartseer: Interactive steering exploratory visual analysis with machine intelligence. IEEE Transactions on Visualization and Computer Graphics 28, 3 (2020), 1500–1513.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Hyung-Kwon Ko (7 papers)
  2. Hyeon Jeon (26 papers)
  3. Gwanmo Park (3 papers)
  4. Dae Hyun Kim (12 papers)
  5. Nam Wook Kim (14 papers)
  6. Juho Kim (56 papers)
  7. Jinwook Seo (30 papers)
Citations (14)

Summary

  • The paper introduces VL2NL, which leverages LLMs to generate rich natural language datasets for building intuitive visualization interfaces.
  • It outlines a three-stage process involving preprocessing, guided chart semantics analysis, and score-based paraphrasing for syntactic diversity.
  • Empirical results using 1,981 Vega-Lite specifications demonstrate high accuracy and enhanced utility for natural language interfaces in data visualization.

A Framework for LLM-Powered Natural Language Dataset Generation for Visualizations

The paper "Natural Language Dataset Generation Framework for Visualizations Powered by LLMs" introduces a comprehensive framework, VL2NL, designed to leverage LLMs for the automated generation of natural language (NL) datasets. The ultimate aim is to enhance the development of Natural Language Interfaces (NLIs) for data visualizations using Vega-Lite specifications, a widely recognized grammar for creating interactive visualizations.

Core Contributions and Methodology

The primary contribution of the work is the development of VL2NL, which systematically generates rich and diverse NL datasets. These datasets are crucial for building NLIs that facilitate user interaction with data visualizations in more intuitive ways. The paper delineates a three-stage process within VL2NL:

  1. Pre-processing of Data and Specifications: The framework first involves the minimization of Vega-Lite specifications by externalizing dataset links. This step minimizes the token count, optimizing input for LLM processing.
  2. Chart Semantics Analysis through Guided Discovery: VL2NL employs a guided discovery approach, leveraging the reasoning capabilities of LLMs. This involves scaffolding and key questioning techniques to analyze chart semantics accurately and integrate them into dataset generation. The method ensures the generation of faithful and relevant NL descriptions.
  3. Enhancing Syntactic Diversity with Score-Based Paraphrasing: The framework implements a novel, score-based paraphrasing method, which explores syntactic variations across four defined language axes: formality, clarity, expertise, and subjectivity. This is achieved by systematically varying the scoring across these axes to generate diverse paraphrased outputs while maintaining semantic integrity.

Results and Evaluation

The researchers collected a new dataset of 1,981 Vega-Lite specifications from GitHub, which they highlight as being diverse and exceeding the complexity of existing datasets. This dataset serves as the input for VL2NL. After processing with VL2NL, the researchers report high accuracy in the generation of L1 and L2 captions, achieving 89.4% and 76.0% accuracy, respectively, under strict criteria.

The generated NL datasets were further evaluated for diversity against existing benchmarks and human-generated datasets. The results indicated that VL2NL-generated datasets exhibited greater syntactic diversity, as evidenced by their performance across several within-distribution diversity metrics.

The framework’s utility was further demonstrated in a practical scenario where it was used to train models for visual data retrieval tasks. The inclusion of LLM-generated NL datasets improved model performance, demonstrating the practical applicability and benefits of the proposed framework.

Theoretical Implications and Future Directions

From a theoretical standpoint, VL2NL provides a structured approach to harnessing the power of LLMs for generating domain-specific textual datasets. The introduction of guided discovery in prompting and score-based paraphrasing are notable advancements that could influence future AI research for data visualization and other domains.

The paper also highlights future exploration avenues, such as extending the framework to generate other types of NL datasets like conversational dialogues or domain-specific references. Moreover, the authors suggest integrating external resources to mitigate the information limitations inherent in Vega-Lite specifications.

Conclusion

Overall, this research presents a systematic and strategic approach to generating NL datasets necessary for developing effective NLIs in data visualization. By improving the diversity and accuracy of these datasets, VL2NL stands to significantly contribute to the continued development and refinement of user-friendly, natural language-driven data visualization tools. As AI continues to integrate into user interface design, frameworks like VL2NL will be crucial in enabling seamless, intuitive human-computer interactions.

Github Logo Streamline Icon: https://streamlinehq.com