Natural Language Dataset Generation Framework for Visualizations Powered by Large Language Models

Published 19 Sep 2023 in cs.HC | (2309.10245v4)

Abstract: We introduce VL2NL, a LLM framework that generates rich and diverse NL datasets using only Vega-Lite specifications as input, thereby streamlining the development of Natural Language Interfaces (NLIs) for data visualization. To synthesize relevant chart semantics accurately and enhance syntactic diversity in each NL dataset, we leverage 1) a guided discovery incorporated into prompting so that LLMs can steer themselves to create faithful NL datasets in a self-directed manner; 2) a score-based paraphrasing to augment NL syntax along with four language axes. We also present a new collection of 1,981 real-world Vega-Lite specifications that have increased diversity and complexity than existing chart collections. When tested on our chart collection, VL2NL extracted chart semantics and generated L1/L2 captions with 89.4% and 76.0% accuracy, respectively. It also demonstrated generating and paraphrasing utterances and questions with greater diversity compared to the benchmarks. Last, we discuss how our NL datasets and framework can be utilized in real-world scenarios. The codes and chart collection are available at https://github.com/hyungkwonko/chart-LLM.

Abstract PDF HTML Upgrade to Chat

Authors (7)

References (98)

Citations (14)

View on Semantic Scholar

Summary

The paper introduces VL2NL, which leverages LLMs to generate rich natural language datasets for building intuitive visualization interfaces.
It outlines a three-stage process involving preprocessing, guided chart semantics analysis, and score-based paraphrasing for syntactic diversity.
Empirical results using 1,981 Vega-Lite specifications demonstrate high accuracy and enhanced utility for natural language interfaces in data visualization.

A Framework for LLM-Powered Natural Language Dataset Generation for Visualizations

The paper "Natural Language Dataset Generation Framework for Visualizations Powered by LLMs" introduces a comprehensive framework, VL2NL, designed to leverage LLMs for the automated generation of natural language (NL) datasets. The ultimate aim is to enhance the development of Natural Language Interfaces (NLIs) for data visualizations using Vega-Lite specifications, a widely recognized grammar for creating interactive visualizations.

Core Contributions and Methodology

The primary contribution of the work is the development of VL2NL, which systematically generates rich and diverse NL datasets. These datasets are crucial for building NLIs that facilitate user interaction with data visualizations in more intuitive ways. The paper delineates a three-stage process within VL2NL:

Pre-processing of Data and Specifications: The framework first involves the minimization of Vega-Lite specifications by externalizing dataset links. This step minimizes the token count, optimizing input for LLM processing.
Chart Semantics Analysis through Guided Discovery: VL2NL employs a guided discovery approach, leveraging the reasoning capabilities of LLMs. This involves scaffolding and key questioning techniques to analyze chart semantics accurately and integrate them into dataset generation. The method ensures the generation of faithful and relevant NL descriptions.
Enhancing Syntactic Diversity with Score-Based Paraphrasing: The framework implements a novel, score-based paraphrasing method, which explores syntactic variations across four defined language axes: formality, clarity, expertise, and subjectivity. This is achieved by systematically varying the scoring across these axes to generate diverse paraphrased outputs while maintaining semantic integrity.

Results and Evaluation

The researchers collected a new dataset of 1,981 Vega-Lite specifications from GitHub, which they highlight as being diverse and exceeding the complexity of existing datasets. This dataset serves as the input for VL2NL. After processing with VL2NL, the researchers report high accuracy in the generation of L1 and L2 captions, achieving 89.4% and 76.0% accuracy, respectively, under strict criteria.

The generated NL datasets were further evaluated for diversity against existing benchmarks and human-generated datasets. The results indicated that VL2NL-generated datasets exhibited greater syntactic diversity, as evidenced by their performance across several within-distribution diversity metrics.

The framework’s utility was further demonstrated in a practical scenario where it was used to train models for visual data retrieval tasks. The inclusion of LLM-generated NL datasets improved model performance, demonstrating the practical applicability and benefits of the proposed framework.

Theoretical Implications and Future Directions

From a theoretical standpoint, VL2NL provides a structured approach to harnessing the power of LLMs for generating domain-specific textual datasets. The introduction of guided discovery in prompting and score-based paraphrasing are notable advancements that could influence future AI research for data visualization and other domains.

The paper also highlights future exploration avenues, such as extending the framework to generate other types of NL datasets like conversational dialogues or domain-specific references. Moreover, the authors suggest integrating external resources to mitigate the information limitations inherent in Vega-Lite specifications.

Conclusion

Overall, this research presents a systematic and strategic approach to generating NL datasets necessary for developing effective NLIs in data visualization. By improving the diversity and accuracy of these datasets, VL2NL stands to significantly contribute to the continued development and refinement of user-friendly, natural language-driven data visualization tools. As AI continues to integrate into user interface design, frameworks like VL2NL will be crucial in enabling seamless, intuitive human-computer interactions.

Markdown Report Issue