iScore: Visual Analytics for Interpreting How Language Models Automatically Score Summaries (2403.04760v1)
Abstract: The recent explosion in popularity of LLMs has inspired learning engineers to incorporate them into adaptive educational tools that automatically score summary writing. Understanding and evaluating LLMs is vital before deploying them in critical learning environments, yet their unprecedented size and expanding number of parameters inhibits transparency and impedes trust when they underperform. Through a collaborative user-centered design process with several learning engineers building and deploying summary scoring LLMs, we characterized fundamental design challenges and goals around interpreting their models, including aggregating large text inputs, tracking score provenance, and scaling LLM interpretability methods. To address their concerns, we developed iScore, an interactive visual analytics tool for learning engineers to upload, score, and compare multiple summaries simultaneously. Tightly integrated views allow users to iteratively revise the language in summaries, track changes in the resulting LLM scores, and visualize model weights at multiple levels of abstraction. To validate our approach, we deployed iScore with three learning engineers over the course of a month. We present a case study where interacting with iScore led a learning engineer to improve their LLM's score accuracy by three percentage points. Finally, we conducted qualitative interviews with the learning engineers that revealed how iScore enabled them to understand, evaluate, and build trust in their LLMs during deployment.
- A Diagnostic Study of Explainability Techniques for Text Classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 3256–3274. https://doi.org/10.18653/v1/2020.emnlp-main.263
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020).
- On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada) (FAccT ’21). Association for Computing Machinery, New York, NY, USA, 610–623. https://doi.org/10.1145/3442188.3445922
- Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”.
- D33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT Data-Driven Documents. IEEE Transactions on Visualization and Computer Graphics 17, 12 (2011), 2301–2309. https://doi.org/10.1109/TVCG.2011.185
- Multitask Summary Scoring with Longformers. In Artificial Intelligence in Education, Maria Mercedes Rodrigo, Noburu Matsuda, Alexandra I. Cristea, and Vania Dimitrova (Eds.). Springer International Publishing, Cham, 756–761. https://doi.org/10.1007/978-3-031-11644-5_79
- R.E. Boyatzis. 1998. Transforming Qualitative Information: Thematic Analysis and Code Development. SAGE Publications.
- Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
- The Return of Intelligent Textbooks. AI Magazine 43, 3 (Aug. 2022), 337–340. https://doi.org/10.1002/aaai.12061
- DropoutSeer: Visualizing learning patterns in Massive Open Online Courses for dropout reasoning and prediction. In 2016 IEEE Conference on Visual Analytics Science and Technology (VAST). 111–120. https://doi.org/10.1109/VAST.2016.7883517
- What Does BERT Look at? An Analysis of BERT’s Attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Florence, Italy, 276–286. https://doi.org/10.18653/v1/W19-4828
- Automated Summarization Evaluation (ASE) Using Natural Language Processing Tools. In Artificial Intelligence in Education, Seiji Isotani, Eva Millán, Amy Ogan, Peter Hastings, Bruce McLaren, and Rose Luckin (Eds.). Springer International Publishing, Cham, 84–95. https://doi.org/10.1007/978-3-030-23204-7_8
- Attention Flows: Analyzing and Comparing Attention Mechanisms in Language Models. IEEE Transactions on Visualization and Computer Graphics 27, 2 (2021), 1160–1170. https://doi.org/10.1109/TVCG.2020.3028976
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
- Fluid interaction for information visualization. Information Visualization 10, 4 (2011), 327–340. https://doi.org/10.1177/1473871611413180
- MOOC visual analytics: Empowering students, teachers, researchers, and platform developers of massively open online courses. Journal of the Association for Information Science and Technology 68, 10 (2017), 2350–2363. https://doi.org/10.1002/asi.23852
- David Galbraith and Veerle M. Baaijen. 2018. The Work of Writing: Raiding the Inarticulate. Educational Psychologist 53, 4 (2018), 238–257. https://doi.org/10.1080/00461520.2018.1505515
- Peer assessment in MOOCs: Systematic literature review. Distance Education 42, 2 (2021), 268–289. https://doi.org/10.1080/01587919.2021.1911626
- SDA-Vis: A Visualization System for Student Dropout Analysis Based on Counterfactual Exploration. Applied Sciences 12, 12 (2022). https://doi.org/10.3390/app12125785
- LM-Debugger: An Interactive Tool for Inspection and Intervention in Transformer-Based Language Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Abu Dhabi, UAE, 12–21. https://doi.org/10.18653/v1/2022.emnlp-demos.2
- Steve Graham and Karen R. Harris. 2015. Common Core State Standards and Writing: Introduction to the Special Issue. The Elementary School Journal 115, 4 (2015), 457–463. https://doi.org/10.1086/681963
- The Effects of Writing on Learning in Science, Social Studies, and Mathematics: A Meta-Analysis. Review of Educational Research 90, 2 (2020), 179–226. https://doi.org/10.3102/0034654320914744
- An examination of summary writing as a measure of reading comprehension. Reading Research and Instruction 28, 4 (1989), 1–11. https://doi.org/10.1080/19388078909557982
- Visual Analytics in Deep Learning: An Interrogative Survey for the Next Frontiers. IEEE Transactions on Visualization and Computer Graphics 25, 8 (2019), 2674–2693. https://doi.org/10.1109/TVCG.2018.2843369
- exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformer Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, Online, 187–196. https://doi.org/10.18653/v1/2020.acl-demos.22
- Sarthak Jain and Byron C. Wallace. 2019. Attention is not Explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 3543–3556. https://doi.org/10.18653/v1/N19-1357
- ChatGPT for good? On opportunities and challenges of large language models for education. Learning and individual differences 103 (2023), 102274.
- Principles of Explanatory Debugging to Personalize Interactive Machine Learning. In Proceedings of the 20th International Conference on Intelligent User Interfaces (Atlanta, Georgia, USA) (IUI ’15). Association for Computing Machinery, New York, NY, USA, 126–137. https://doi.org/10.1145/2678025.2701399
- Multilingual transformer-based personality traits estimation. Information 11, 4 (2020), 179.
- Q Vera Liao and Jennifer Wortman Vaughan. 2023. AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap. arXiv preprint arXiv:2306.01941 (2023).
- Towards better analysis of machine learning models: A visual analytics perspective. Visual Informatics 1, 1 (2017), 48–56. https://doi.org/10.1016/j.visinf.2017.01.006
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
- Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 4765–4774. http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
- Using Large Language Models to Provide Formative Feedback in Intelligent Textbooks. In Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky, Ning Wang, Genaro Rebolledo-Mendez, Vania Dimitrova, Noboru Matsuda, and Olga C. Santos (Eds.). Springer Nature Switzerland, Cham, 484–489. https://doi.org/10.1007/978-3-031-36336-8_75
- Visual analytics of video-clickstream data and prediction of learners’ performance using deep learning models in MOOCs’ courses. Computer Applications in Engineering Education 29, 4 (2021), 710–732. https://doi.org/10.1002/cae.22328
- Nancy Nelson and James R. King. 2023. Discourse synthesis: Textual transformations in writing from sources. Reading and Writing 36, 4 (01 Apr 2023), 769–808. https://doi.org/10.1007/s11145-021-10243-5
- Christopher Michael Ormerod. 2022. Mapping between hidden states and features to validate automated essay scoring using DeBERTa models. Psychological Test and Assessment Modeling 64, 4 (2022), 495–526.
- Emily Phillips Galloway and Paola Uccelli. 2019. Beyond reading comprehension: exploring the additional contribution of Core Academic Language Skills to early adolescents’ written summaries. Reading and Writing 32, 3 (01 Mar 2019), 729–759. https://doi.org/10.1007/s11145-018-9880-3
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67.
- ”Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). Association for Computing Machinery, New York, NY, USA, 1135–1144. https://doi.org/10.1145/2939672.2939778
- A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics 8 (2020), 842–866. https://doi.org/10.1162/tacl_a_00349
- Design Study Methodology: Reflections from the Trenches and the Stacks. IEEE Transactions on Visualization and Computer Graphics 18, 12 (2012), 2431–2440. https://doi.org/10.1109/TVCG.2012.213
- Yeonju Jang Seongyune Choi and Hyeoncheol Kim. 2023. Influence of Pedagogical Beliefs and Perceived Trust on Teachers’ Acceptance of Educational Artificial Intelligence Tools. International Journal of Human–Computer Interaction 39, 4 (2023), 910–922. https://doi.org/10.1080/10447318.2022.2049145 arXiv:https://doi.org/10.1080/10447318.2022.2049145
- Angélica M. Silva and Roberto Limongi. 2019. Writing to learn increases long-term memory consolidation: A mental-chronometry and computational-modeling study of “Epistemic writing”. Journal of Writing Research 11, 1 (Jun. 2019), 211–243. https://doi.org/10.17239/jowr-2019.11.01.07
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 (2022).
- Kenneth Steimel and Brian Riordan. 2020. Towards instance-based content scoring with pre-trained transformer models. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, Vol. 34.
- Seq2seq-Vis: A Visual Debugging Tool for Sequence-to-Sequence Models. IEEE Transactions on Visualization and Computer Graphics 25, 1 (2019), 353–363. https://doi.org/10.1109/TVCG.2018.2865044
- Axiomatic Attribution for Deep Networks. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 3319–3328. https://proceedings.mlr.press/v70/sundararajan17a.html
- The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 107–118. https://doi.org/10.18653/v1/2020.emnlp-demos.15
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
- Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
- Visual learning analytics of educational data: A systematic literature review and research agenda. Computers & Education 122 (2018), 119–135. https://doi.org/10.1016/j.compedu.2018.03.018
- Jesse Vig. 2019. A Multiscale Visualization of Attention in the Transformer Model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, Florence, Italy, 37–42. https://doi.org/10.18653/v1/P19-3007
- Trick Me If You Can: Human-in-the-Loop Generation of Adversarial Examples for Question Answering. Transactions of the Association for Computational Linguistics 7 (2019), 387–401. https://doi.org/10.1162/tacl_a_00279
- GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Brussels, Belgium, 353–355. https://doi.org/10.18653/v1/W18-5446
- VizSeq: a visual analysis toolkit for text generation tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations. Association for Computational Linguistics, Hong Kong, China, 253–258. https://doi.org/10.18653/v1/D19-3043
- Putting Humans in the Natural Language Processing Loop: A Survey. In Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing. Association for Computational Linguistics, Online, 47–52. https://aclanthology.org/2021.hcinlp-1.8
- Dodrio: Exploring Transformer Models with Interactive Visualization. In Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 132–141. https://zijie.wang/papers/dodrio/
- CNN Explainer: Learning Convolutional Neural Networks with Interactive Visualization. IEEE Transactions on Visualization and Computer Graphics 27, 2 (2021), 1396–1406. https://doi.org/10.1109/TVCG.2020.3030418
- Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not Explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 11–20. https://doi.org/10.18653/v1/D19-1002
- Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38–45. https://doi.org/10.18653/v1/2020.emnlp-demos.6
- Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453 (2023).
- Towards a better understanding of the role of visualization in online learning: A review. Visual Informatics 6, 4 (2022), 22–33. https://doi.org/10.1016/j.visinf.2022.09.002
- Visual analytics of potential dropout behavior patterns in online learning based on counterfactual explanation. Journal of Visualization 26, 3 (01 Jun 2023), 723–741. https://doi.org/10.1007/s12650-022-00899-8