LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models (2402.10524v1)
Abstract: Automatic side-by-side evaluation has emerged as a promising approach to evaluating the quality of responses from LLMs. However, analyzing the results from this evaluation approach raises scalability and interpretability challenges. In this paper, we present LLM Comparator, a novel visual analytics tool for interactively analyzing results from automatic side-by-side evaluation. The tool supports interactive workflows for users to understand when and why a model performs better or worse than a baseline model, and how the responses from two models are qualitatively different. We iteratively designed and developed the tool by closely working with researchers and engineers at a large technology company. This paper details the user challenges we identified, the design and development of the tool, and an observational study with participants who regularly evaluate their models.
- ModelTracker: Redesigning performance analysis tools for machine learning. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI). 337–346. https://doi.org/10.1145/2702123.2702509
- Concrete problems in AI safety. arXiv preprint arXiv:1606.06565 (2016). https://arxiv.org/abs/1606.06565
- PaLM 2 technical report. arXiv preprint arXiv:2305.10403 (2023). https://arxiv.org/abs/2305.10403
- ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing. arXiv preprint arXiv:2309.09128 (2023). https://arxiv.org/abs/2309.09128
- Embedding Comparator: Visualizing differences in global structure and local neighborhoods via small multiples. In 27th International Conference on Intelligent User Interfaces (IUI). 746–766. https://doi.org/10.1145/3490099.3511122
- The Role of Interactive Visualization in Explaining (Large) NLP Models: from Data to Inference. arXiv preprint arXiv:2301.04528 (2023). https://arxiv.org/abs/2301.04528
- Google Cloud. 2024. Perform automatic side-by-side evaluation. https://cloud.google.com/vertex-ai/docs/generative-ai/models/side-by-side-eval
- Adam Coscia and Alex Endert. 2023. KnowledgeVIS: Interpreting Language Models by Comparing Fill-in-the-Blank Prompts. IEEE Transactions on Visualization and Computer Graphics (2023).
- Boxer: Interactive comparison of classifier results. In Computer Graphics Forum, Vol. 39. Wiley Online Library, 181–193. https://arxiv.org/abs/2004.07964
- ActiVis: Visual exploration of industry-scale deep neural network models. IEEE Transactions on Visualization and Computer Graphics 24, 1 (2017), 88–97. https://doi.org/10.1109/TVCG.2017.2744718
- EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria. arXiv preprint arXiv:2309.13633 (2023). https://arxiv.org/abs/2309.13633
- LiPO: Listwise Preference Optimization through Learning-to-Rank. arXiv preprint arXiv:2402.01878 (2024). https://arxiv.org/abs/2402.01878
- RuleMatrix: Visualizing and understanding classifiers with rules. IEEE Transactions on Visualization and Computer Graphics 25, 1 (2018), 342–352. https://doi.org/10.1109/TVCG.2018.2864812
- LMdiff: A visual diff tool to compare language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations. https://arxiv.org/abs/2111.01582
- Interactive and visual prompt engineering for ad-hoc task adaptation with large language models. IEEE Transactions on Visualization and Computer Graphics 29, 1 (2022), 1146–1156. https://arxiv.org/abs/2208.07852
- The language interpretability tool: Extensible, interactive visualizations and analysis for NLP models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations. https://arxiv.org/abs/2008.05122
- Learning-from-disagreement: A model comparison and visual analytics framework. IEEE Transactions on Visualization and Computer Graphics (2022). https://arxiv.org/abs/2201.07849
- Goal-Driven Explainable Clustering via Language Descriptions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). 10626–10649. https://doi.org/10.18653/v1/2023.emnlp-main.657
- Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering 48, 1 (2020), 1–36. https://arxiv.org/abs/1906.10742
- Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. In Neural Information Processing Systems (NeurIPS): Datasets and Benchmarks Track. https://arxiv.org/abs/2306.05685
- Describing differences between text distributions with natural language. In International Conference on Machine Learning (ICML). PMLR, 27099–27116. https://proceedings.mlr.press/v162/zhong22a.html