Natural Language Dataset Generation Framework for Visualizations Powered by Large Language Models (2309.10245v4)
Abstract: We introduce VL2NL, a LLM framework that generates rich and diverse NL datasets using only Vega-Lite specifications as input, thereby streamlining the development of Natural Language Interfaces (NLIs) for data visualization. To synthesize relevant chart semantics accurately and enhance syntactic diversity in each NL dataset, we leverage 1) a guided discovery incorporated into prompting so that LLMs can steer themselves to create faithful NL datasets in a self-directed manner; 2) a score-based paraphrasing to augment NL syntax along with four language axes. We also present a new collection of 1,981 real-world Vega-Lite specifications that have increased diversity and complexity than existing chart collections. When tested on our chart collection, VL2NL extracted chart semantics and generated L1/L2 captions with 89.4% and 76.0% accuracy, respectively. It also demonstrated generating and paraphrasing utterances and questions with greater diversity compared to the benchmarks. Last, we discuss how our NL datasets and framework can be utilized in real-world scenarios. The codes and chart collection are available at https://github.com/hyungkwonko/chart-LLM.
- Interactive Exploration and Refinement of Facial Expression Using Manifold Learning. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology (Virtual Event, USA) (UIST ’20). Association for Computing Machinery, New York, NY, USA, 778–790. https://doi.org/10.1145/3379337.3415877
- Low-level components of analytic activity in information visualization. In IEEE Symposium on Information Visualization, 2005. INFOVIS 2005. IEEE, 111–117.
- EmoBalloon-Conveying Emotional Arousal in Text Chats with Speech Balloons. In CHI Conference on Human Factors in Computing Systems. 1–16.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073 (2022).
- Soylent: a word processor with a crowd inside. In Proceedings of the 23nd annual ACM symposium on User interface software and technology. 313–322.
- What makes a visualization memorable? IEEE transactions on visualization and computer graphics 19, 12 (2013), 2306–2315.
- Ann L Brown and Joseph C Campione. 1994. Guided discovery in a community of learners. The MIT Press.
- Readings in information visualization: using vision to think. Morgan Kaufmann.
- Harrison Chase. 2022. LangChain. https://github.com/hwchase17/langchain
- Leaf-qa: Locate, encode & attend for figure question answering. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3512–3521.
- Chen Chen and Zhicheng Liu. 2023. The State of the Art in Creating Visualization Corpora for Automated Chart Analysis. Computer Graphics Forum (2023). https://doi.org/10.1111/cgf.14855
- Composition and configuration patterns in multiple-view visualizations. IEEE Transactions on Visualization and Computer Graphics 27, 2 (2020), 1514–1524.
- A multi-modal natural language interface to an information visualization environment. International Journal of Speech Technology 4 (2001), 297–314.
- Chart mining: A survey of methods for automated chart analysis. IEEE transactions on pattern analysis and machine intelligence 43, 11 (2020), 3799–3819.
- Ton De Jong and Wouter R Van Joolingen. 1998. Scientific discovery learning with computer simulations of conceptual domains. Review of educational research 68, 2 (1998), 179–201.
- Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th annual ACM symposium on user interface software and technology. 845–854.
- Victor Dibia. 2023. LIDA: A Tool for Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models. arXiv preprint arXiv:2303.02927 (2023).
- Victor Dibia and Çağatay Demiralp. 2019. Data2vis: Automatic generation of data visualizations using sequence-to-sequence recurrent neural networks. IEEE computer graphics and applications 39, 5 (2019), 33–46.
- Quda: natural language queries for visual data analytics. arXiv preprint arXiv:2005.03257 (2020).
- Datatone: Managing ambiguity in natural language interfaces for data visualization. In Proceedings of the 28th annual acm symposium on user interface software & technology. 489–500.
- Evaluating large language models in generating synthetic hci research data: a case study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–19.
- Jonathan Harper and Maneesh Agrawala. 2017. Converting basic D3 charts into reusable style templates. IEEE transactions on visualization and computer graphics 24, 3 (2017), 1274–1286.
- Scaffolding and achievement in problem-based and inquiry learning: a response to Kirschner, Sweller, and. Educational psychologist 42, 2 (2007), 99–107.
- Chart question answering: State of the art and future directions. In Computer Graphics Forum, Vol. 41. Wiley Online Library, 555–572.
- Applying pragmatics principles for interaction with visual analytics. IEEE transactions on visualization and computer graphics 24, 1 (2017), 309–318.
- DIVE: A mixed-initiative system supporting integrated data exploration workflows. In Proceedings of the workshop on human-in-the-loop data analytics. 1–7.
- Towards Visualisation Specifications from Multilingual Natural Language Queries using Large Language Models. In EuroVis 2023 - Posters, Christina Gillmann, Michael Krone, and Simone Lenti (Eds.). The Eurographics Association. https://doi.org/10.2312/evp.20231072
- Survey of hallucination in natural language generation. Comput. Surveys 55, 12 (2023), 1–38.
- Chartsense: Interactive data extraction from chart images. In Proceedings of the 2017 chi conference on human factors in computing systems. 6706–6717.
- Profiler: Integrated statistical analysis and visualization for data quality assessment. In Proceedings of the International Working Conference on Advanced Visual Interfaces. 547–554.
- Chart-to-text: A large-scale benchmark for chart summarization. arXiv preprint arXiv:2203.06486 (2022).
- Answering questions about charts and generating visual explanations. In Proceedings of the 2020 CHI conference on human factors in computing systems. 1–13.
- Facilitating document reading by linking text and tables. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology. 423–434.
- Towards understanding how readers integrate charts and captions: A case study with line charts. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–11.
- Robert Kincaid and Graham Pollock. 2017. Nicky: Toward a virtual assistant for test and measurement instrument recommendations. In 2017 IEEE 11th International Conference on Semantic Computing (ICSC). IEEE, 196–203.
- Crowdforge: Crowdsourcing complex work. In Proceedings of the 24th annual ACM symposium on User interface software and technology. 43–52.
- Mark Klein and Ana Cristina Bicharra Garcia. 2015. High-speed idea filtering with the bag of lemons. Decision Support Systems 78 (2015), 39–50.
- We-toon: A Communication Support System between Writers and Artists in Collaborative Webtoon Sketch Revision. In The 35th Annual ACM Symposium on User Interface Software and Technology. 1–14.
- Extracting references between text and charts via crowdsourcing. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems. 31–40.
- Diverse interaction recommendation for public users exploring multi-view visualization using deep learning. IEEE Transactions on Visualization and Computer Graphics 29, 1 (2022), 95–105.
- FlexKBQA: A Flexible LLM-Powered Framework for Few-Shot Knowledge Base Question Answering. arXiv preprint arXiv:2308.12060 (2023).
- Advisor: Automatic visualization answer for natural-language question on tabular data. In 2021 IEEE 14th Pacific Visualization Symposium (PacificVis). IEEE, 11–20.
- DePlot: One-shot visual language reasoning by plot-to-table translation. arXiv preprint arXiv:2212.10505 (2022).
- Matcha: Enhancing visual language pretraining with math reasoning and chart derendering. arXiv preprint arXiv:2212.09662 (2022).
- Alan Lundgard and Arvind Satyanarayan. 2021. Accessible visualization via natural language descriptions: A four-level model of semantic content. IEEE transactions on visualization and computer graphics 28, 1 (2021), 1073–1083.
- Deepeye: Towards automatic data visualization. In 2018 IEEE 34th international conference on data engineering (ICDE). IEEE, 101–112.
- Synthesizing natural language to visualization (NL2VIS) benchmarks from NL2SQL benchmarks. In Proceedings of the 2021 International Conference on Management of Data. 1235–1247.
- Natural language to visualization by neural machine translation. IEEE Transactions on Visualization and Computer Graphics 28, 1 (2021), 217–226.
- Jock Mackinlay. 1986. Automating the design of graphical presentations of relational information. Acm Transactions On Graphics (Tog) 5, 2 (1986), 110–141.
- Show me: Automatic presentation for visual analysis. IEEE transactions on visualization and computer graphics 13, 6 (2007), 1137–1144.
- LineCap: Line Charts for Data Visualization Captioning Models. In 2022 IEEE Visualization and Visual Analytics (VIS). IEEE, 35–39.
- ChartQA: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244 (2022).
- hdbscan: Hierarchical density based clustering. J. Open Source Softw. 2, 11 (2017), 205.
- Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018).
- Generating training data with language models: Towards zero-shot language understanding. Advances in Neural Information Processing Systems 35 (2022), 462–477.
- Guiding novice web workers in making image descriptions using templates. ACM Transactions on Accessible Computing (TACCESS) 7, 4 (2015), 1–21.
- Formalizing visualization design knowledge as constraints: Actionable and extensible models in draco. IEEE transactions on visualization and computer graphics 25, 1 (2018), 438–448.
- GANSpiration: Balancing Targeted and Serendipitous Inspiration in User Interface Design with Style-Based Generative Adversarial Network. In CHI Conference on Human Factors in Computing Systems. 1–15.
- DIY: Assessing the correctness of natural language to sql systems. In 26th International Conference on Intelligent User Interfaces. 597–607.
- NL4DV: A toolkit for generating analytic specifications for data visualization from natural language queries. IEEE Transactions on Visualization and Computer Graphics 27, 2 (2020), 369–379.
- Jason Obeid and Enamul Hoque. 2020. Chart-to-text: Generating natural language descriptions for charts by adapting the transformer model. arXiv preprint arXiv:2010.09142 (2020).
- Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442 (2023).
- Jorge Poco and Jeffrey Heer. 2017. Reverse-engineering visualizations: Recovering visual encodings from chart images. In Computer graphics forum, Vol. 36. Wiley Online Library, 353–363.
- Generating accurate caption units for figure captioning. In Proceedings of the Web Conference 2021. 2792–2804.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/1908.10084
- Directed diversity: Leveraging language embedding distances for collective creativity in crowd ideation. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–35.
- CLASP: Few-shot cross-lingual data augmentation for semantic parsing. arXiv preprint arXiv:2210.07074 (2022).
- Vega-Lite: A Grammar of Interactive Graphics. IEEE Transactions on Visualization & Computer Graphics (Proc. InfoVis) (2017). https://doi.org/10.1109/tvcg.2016.2599030
- Vega Editor.
- Vega-Lite gallery.
- Reactive vega: A streaming dataflow architecture for declarative interactive visualization. IEEE transactions on visualization and computer graphics 22, 1 (2015), 659–668.
- Timo Schick and Hinrich Schütze. 2021. Generating datasets with pretrained language models. arXiv preprint arXiv:2104.07540 (2021).
- Eviza: A natural language interface for visual analysis. In Proceedings of the 29th annual symposium on user interface software and technology. 365–377.
- Towards natural language interfaces for data visualization: A survey. IEEE transactions on visualization and computer graphics (2022).
- Data player: Automatic generation of data videos with narration-animation interplay. arXiv preprint arXiv:2308.04703 (2023).
- IdeaHound: improving large-scale collaborative ideation with crowd-powered real-time semantic modeling. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology. 609–624.
- Andrea Spreafico and Giuseppe Carenini. 2020. Neural data-driven captioning of time-series line charts. In Proceedings of the International Conference on Advanced Visual Interfaces. 1–5.
- Augmenting visualizations with interactive data facts to facilitate interpretation and communication. IEEE transactions on visualization and computer graphics 25, 1 (2018), 672–681.
- Collecting and characterizing natural language utterances for specifying data visualizations. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–10.
- Arjun Srinivasan and John Stasko. 2017a. Natural language interfaces for data analysis with visualization: Considering what has and could be asked. In Proceedings of the Eurographics/IEEE VGTC conference on visualization: Short papers. 55–59.
- Arjun Srinivasan and John Stasko. 2017b. Orko: Facilitating multimodal interaction for visual exploration and analysis of networks. IEEE transactions on visualization and computer graphics 24, 1 (2017), 511–521.
- Polaris: A system for query, analysis, and visualization of multidimensional relational databases. IEEE Transactions on Visualization and Computer Graphics 8, 1 (2002), 52–65.
- Nicole Sultanum and Arjun Srinivasan. 2023. DataTales: Investigating the use of Large Language Models for Authoring Data-Driven Articles. arXiv preprint arXiv:2308.04076 (2023).
- VisText: A Benchmark for Semantically Rich Chart Captioning. In The Annual Meeting of the Association for Computational Linguistics (ACL). http://vis.csail.mit.edu/pubs/vistext
- Seedb: Efficient data-driven visualization recommendations to support visual analytics. In Proceedings of the VLDB Endowment International Conference on Very Large Data Bases, Vol. 8. NIH Public Access, 2182.
- Towards Natural Language-Based Visualization Authoring. IEEE Transactions on Visualization and Computer Graphics 29, 1 (2022), 1222–1232.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
- Voyager: Exploratory analysis via faceted browsing of visualization recommendations. IEEE transactions on visualization and computer graphics 22, 1 (2015), 649–658.
- Towards a general-purpose query language for visualization recommendation. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics. 1–6.
- Voyager 2: Augmenting visual analysis with partial view specifications. In Proceedings of the 2017 chi conference on human factors in computing systems. 2648–2659.
- WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–14.
- Yumo Xu and Shay B Cohen. 2018. Stock movement prediction from tweets and historical prices. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1970–1979.
- Byt5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics 10 (2022), 291–306.
- ReAct: Synergizing Reasoning and Acting in Language Models. In International Conference on Learning Representations (ICLR).
- Zerogen: Efficient zero-shot learning via dataset generation. arXiv preprint arXiv:2202.07922 (2022).
- Generating Data for Symbolic Language with Large Language Models. arXiv preprint arXiv:2305.13917 (2023).
- Bowen Yu and Cláudio T Silva. 2019. FlowSense: A natural language interface for visual data exploration within a dataflow system. IEEE transactions on visualization and computer graphics 26, 1 (2019), 1–11.
- Chartseer: Interactive steering exploratory visual analysis with machine intelligence. IEEE Transactions on Visualization and Computer Graphics 28, 3 (2020), 1500–1513.
- Hyung-Kwon Ko (7 papers)
- Hyeon Jeon (26 papers)
- Gwanmo Park (3 papers)
- Dae Hyun Kim (12 papers)
- Nam Wook Kim (14 papers)
- Juho Kim (56 papers)
- Jinwook Seo (30 papers)