AVA: Towards Autonomous Visualization Agents through Visual Perception-Driven Decision-Making (2312.04494v1)
Abstract: With recent advances in multi-modal foundation models, the previously text-only LLMs (LLM) have evolved to incorporate visual input, opening up unprecedented opportunities for various applications in visualization. Our work explores the utilization of the visual perception ability of multi-modal LLMs to develop Autonomous Visualization Agents (AVAs) that can interpret and accomplish user-defined visualization objectives through natural language. We propose the first framework for the design of AVAs and present several usage scenarios intended to demonstrate the general applicability of the proposed paradigm. The addition of visual perception allows AVAs to act as the virtual visualization assistant for domain experts who may lack the knowledge or expertise in fine-tuning visualization outputs. Our preliminary exploration and proof-of-concept agents suggest that this approach can be widely applicable whenever the choices of appropriate visualization parameters require the interpretation of previous visual output. Feedback from unstructured interviews with experts in AI research, medical visualization, and radiology has been incorporated, highlighting the practicality and potential of AVAs. Our study indicates that AVAs represent a general paradigm for designing intelligent visualization systems that can achieve high-level visualization goals, which pave the way for developing expert-level visualization agents in the future.
- Beyond generating code: Evaluating gpt on a data visualization course. arXiv preprint arXiv:2306.02914 (2023).
- Dibia V., Demiralp Ç.: Data2vis: Automatic generation of data visualizations using sequence-to-sequence recurrent neural networks. IEEE computer graphics and applications 39, 5 (2019), 33–46.
- Dibia V.: LIDA: A tool for automatic generation of grammar-agnostic visualizations and infographics using large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) (Toronto, Canada, July 2023), Association for Computational Linguistics, pp. 113–126. URL: https://aclanthology.org/2023.acl-demo.11, doi:10.18653/v1/2023.acl-demo.11.
- Automating transfer function design for comprehensible volume rendering based on 3d field topology analysis. In Proceedings Visualization’99 (Cat. No. 99CB37067) (1999), IEEE, pp. 467–563.
- Filiba T.: Rpyc (remote python call), 2013. Python library for remote procedure calls. URL: https://rpyc.readthedocs.io/en/latest/.
- Layoutgpt: Compositional visual planning and generation with large language models. arXiv preprint arXiv:2305.15393 (2023).
- A comparison of the readability of graphs using node-link and matrix-based representations. In IEEE Symposium on Information Visualization (2004), pp. 17–24. doi:10.1109/INFVIS.2004.1.
- Head dataset. http://www.celebisoftware.com/Dataset.aspx?catId=3. Accessed: [Your Access Date Here].
- Hunter J. D.: Matplotlib: A 2d graphics environment. Computing in science & engineering 9, 03 (2007), 90–95.
- Kaggle diamond dataset. https://www.kaggle.com/datasets/shivam2503/diamonds/data. Accessed: YYYY-MM-DD.
- State of the art in transfer functions for direct volume rendering. In Computer graphics forum (2016), vol. 35, Wiley Online Library, pp. 669–691.
- Kg4vis: A knowledge graph-based approach for visualization recommendation. IEEE Transactions on Visualization and Computer Graphics 28, 1 (2021), 195–205.
- Dynamic opacity optimization for scatter plots. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (2015), pp. 2707–2710.
- Umap: Uniform manifold approximation and projection. Journal of Open Source Software 3, 29 (2018), 861.
- Facilitating conversational interaction in natural language interfaces for visualization. In 2022 IEEE Visualization and Visual Analytics (VIS) (2022), IEEE, pp. 6–10.
- Towards perceptual optimization of the visual design of scatterplots. IEEE transactions on visualization and computer graphics 23, 6 (2017), 1588–1599.
- NL4DV: A Toolkit for generating Analytic Specifications for Data Visualization from Natural Language queries. IEEE Transactions on Visualization and Computer Graphics (TVCG) (2020). doi:10.1109/TVCG.2020.3030378.
- OpenAI: Gpt-4 vision. https://openai.com/research/gpt-4v-system-card, 2023. Accessed: [Insert date of access here].
- Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789 (2023).
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2022), pp. 10684–10695.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1, 2 (2022), 3.
- Learning transferable visual models from natural language supervision. In International conference on machine learning (2021), PMLR, pp. 8748–8763.
- The visible human male: a technical report. Journal of the American Medical Informatics Association 3, 2 (1996), 118–130.
- Sullivan C., Kaszynski A.: Pyvista: 3d plotting and mesh analysis through a streamlined interface for the visualization toolkit (vtk). Journal of Open Source Software 4, 37 (2019), 1450.
- Slavin I., McKenzie D.: Adapting zeroth order algorithms for comparison-based optimization. arXiv preprint arXiv:2210.05824 (2022).
- Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128 (2023).
- Vega-lite: A grammar of interactive graphics. IEEE Transactions on Visualization & Computer Graphics (Proc. InfoVis) (2017). URL: http://idl.cs.washington.edu/papers/vega-lite, doi:10.1109/tvcg.2016.2599030.
- Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2023), pp. 2998–3009.
- Terarecon Inc MERL B., Hospital W.: Teapot dataset. http://www.gris.uni-tuebingen.de/areas/scivis/volren/datasets/data/BostonTeapot.raw.gz. Accessed: [Your Access Date Here].
- Shared and distinct transcriptomic cell types across neocortical areas. Nature 563, 7729 (2018), 72–78.
- Van der Maaten L., Hinton G.: Visualizing data using t-sne. Journal of machine learning research 9, 11 (2008).
- Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155 (2023).
- A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432 (2023).
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
- Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv: Arxiv-2305.16291 (2023).
- Llm4vis: Explainable visualization recommendation using chatgpt. arXiv preprint arXiv:2310.07652 (2023).
- Xmdvtool homepage. http://davis.wpi.edu/xmdv/datasets.html. Accessed: [Access Date].
- A survey on multimodal large language models. arXiv preprint arXiv:2306.13549 (2023).
- The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421 9 (2023).
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 (2022).
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629 (2022).
- Building cooperative embodied agents modularly with large language models. arXiv preprint arXiv:2307.02485 (2023).
- Rich screen reader experiences for accessible data visualization. In Computer Graphics Forum (2022), vol. 41, Wiley Online Library, pp. 15–27.
- Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601 (2023).
- : Adaptive and explainable visualization recommendation for tabular data. IEEE Transactions on Visualization and Computer Graphics (2023).