Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
131 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

iScore: Visual Analytics for Interpreting How Language Models Automatically Score Summaries (2403.04760v1)

Published 7 Mar 2024 in cs.HC, cs.AI, cs.CY, and cs.LG

Abstract: The recent explosion in popularity of LLMs has inspired learning engineers to incorporate them into adaptive educational tools that automatically score summary writing. Understanding and evaluating LLMs is vital before deploying them in critical learning environments, yet their unprecedented size and expanding number of parameters inhibits transparency and impedes trust when they underperform. Through a collaborative user-centered design process with several learning engineers building and deploying summary scoring LLMs, we characterized fundamental design challenges and goals around interpreting their models, including aggregating large text inputs, tracking score provenance, and scaling LLM interpretability methods. To address their concerns, we developed iScore, an interactive visual analytics tool for learning engineers to upload, score, and compare multiple summaries simultaneously. Tightly integrated views allow users to iteratively revise the language in summaries, track changes in the resulting LLM scores, and visualize model weights at multiple levels of abstraction. To validate our approach, we deployed iScore with three learning engineers over the course of a month. We present a case study where interacting with iScore led a learning engineer to improve their LLM's score accuracy by three percentage points. Finally, we conducted qualitative interviews with the learning engineers that revealed how iScore enabled them to understand, evaluate, and build trust in their LLMs during deployment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. A Diagnostic Study of Explainability Techniques for Text Classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 3256–3274. https://doi.org/10.18653/v1/2020.emnlp-main.263
  2. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020).
  3. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada) (FAccT ’21). Association for Computing Machinery, New York, NY, USA, 610–623. https://doi.org/10.1145/3442188.3445922
  4. Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”.
  5. D33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT Data-Driven Documents. IEEE Transactions on Visualization and Computer Graphics 17, 12 (2011), 2301–2309. https://doi.org/10.1109/TVCG.2011.185
  6. Multitask Summary Scoring with Longformers. In Artificial Intelligence in Education, Maria Mercedes Rodrigo, Noburu Matsuda, Alexandra I. Cristea, and Vania Dimitrova (Eds.). Springer International Publishing, Cham, 756–761. https://doi.org/10.1007/978-3-031-11644-5_79
  7. R.E. Boyatzis. 1998. Transforming Qualitative Information: Thematic Analysis and Code Development. SAGE Publications.
  8. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
  9. The Return of Intelligent Textbooks. AI Magazine 43, 3 (Aug. 2022), 337–340. https://doi.org/10.1002/aaai.12061
  10. DropoutSeer: Visualizing learning patterns in Massive Open Online Courses for dropout reasoning and prediction. In 2016 IEEE Conference on Visual Analytics Science and Technology (VAST). 111–120. https://doi.org/10.1109/VAST.2016.7883517
  11. What Does BERT Look at? An Analysis of BERT’s Attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Florence, Italy, 276–286. https://doi.org/10.18653/v1/W19-4828
  12. Automated Summarization Evaluation (ASE) Using Natural Language Processing Tools. In Artificial Intelligence in Education, Seiji Isotani, Eva Millán, Amy Ogan, Peter Hastings, Bruce McLaren, and Rose Luckin (Eds.). Springer International Publishing, Cham, 84–95. https://doi.org/10.1007/978-3-030-23204-7_8
  13. Attention Flows: Analyzing and Comparing Attention Mechanisms in Language Models. IEEE Transactions on Visualization and Computer Graphics 27, 2 (2021), 1160–1170. https://doi.org/10.1109/TVCG.2020.3028976
  14. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
  15. Fluid interaction for information visualization. Information Visualization 10, 4 (2011), 327–340. https://doi.org/10.1177/1473871611413180
  16. MOOC visual analytics: Empowering students, teachers, researchers, and platform developers of massively open online courses. Journal of the Association for Information Science and Technology 68, 10 (2017), 2350–2363. https://doi.org/10.1002/asi.23852
  17. David Galbraith and Veerle M. Baaijen. 2018. The Work of Writing: Raiding the Inarticulate. Educational Psychologist 53, 4 (2018), 238–257. https://doi.org/10.1080/00461520.2018.1505515
  18. Peer assessment in MOOCs: Systematic literature review. Distance Education 42, 2 (2021), 268–289. https://doi.org/10.1080/01587919.2021.1911626
  19. SDA-Vis: A Visualization System for Student Dropout Analysis Based on Counterfactual Exploration. Applied Sciences 12, 12 (2022). https://doi.org/10.3390/app12125785
  20. LM-Debugger: An Interactive Tool for Inspection and Intervention in Transformer-Based Language Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Abu Dhabi, UAE, 12–21. https://doi.org/10.18653/v1/2022.emnlp-demos.2
  21. Steve Graham and Karen R. Harris. 2015. Common Core State Standards and Writing: Introduction to the Special Issue. The Elementary School Journal 115, 4 (2015), 457–463. https://doi.org/10.1086/681963
  22. The Effects of Writing on Learning in Science, Social Studies, and Mathematics: A Meta-Analysis. Review of Educational Research 90, 2 (2020), 179–226. https://doi.org/10.3102/0034654320914744
  23. An examination of summary writing as a measure of reading comprehension. Reading Research and Instruction 28, 4 (1989), 1–11. https://doi.org/10.1080/19388078909557982
  24. Visual Analytics in Deep Learning: An Interrogative Survey for the Next Frontiers. IEEE Transactions on Visualization and Computer Graphics 25, 8 (2019), 2674–2693. https://doi.org/10.1109/TVCG.2018.2843369
  25. exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformer Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, Online, 187–196. https://doi.org/10.18653/v1/2020.acl-demos.22
  26. Sarthak Jain and Byron C. Wallace. 2019. Attention is not Explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 3543–3556. https://doi.org/10.18653/v1/N19-1357
  27. ChatGPT for good? On opportunities and challenges of large language models for education. Learning and individual differences 103 (2023), 102274.
  28. Principles of Explanatory Debugging to Personalize Interactive Machine Learning. In Proceedings of the 20th International Conference on Intelligent User Interfaces (Atlanta, Georgia, USA) (IUI ’15). Association for Computing Machinery, New York, NY, USA, 126–137. https://doi.org/10.1145/2678025.2701399
  29. Multilingual transformer-based personality traits estimation. Information 11, 4 (2020), 179.
  30. Q Vera Liao and Jennifer Wortman Vaughan. 2023. AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap. arXiv preprint arXiv:2306.01941 (2023).
  31. Towards better analysis of machine learning models: A visual analytics perspective. Visual Informatics 1, 1 (2017), 48–56. https://doi.org/10.1016/j.visinf.2017.01.006
  32. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  33. Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 4765–4774. http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
  34. Using Large Language Models to Provide Formative Feedback in Intelligent Textbooks. In Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky, Ning Wang, Genaro Rebolledo-Mendez, Vania Dimitrova, Noboru Matsuda, and Olga C. Santos (Eds.). Springer Nature Switzerland, Cham, 484–489. https://doi.org/10.1007/978-3-031-36336-8_75
  35. Visual analytics of video-clickstream data and prediction of learners’ performance using deep learning models in MOOCs’ courses. Computer Applications in Engineering Education 29, 4 (2021), 710–732. https://doi.org/10.1002/cae.22328
  36. Nancy Nelson and James R. King. 2023. Discourse synthesis: Textual transformations in writing from sources. Reading and Writing 36, 4 (01 Apr 2023), 769–808. https://doi.org/10.1007/s11145-021-10243-5
  37. Christopher Michael Ormerod. 2022. Mapping between hidden states and features to validate automated essay scoring using DeBERTa models. Psychological Test and Assessment Modeling 64, 4 (2022), 495–526.
  38. Emily Phillips Galloway and Paola Uccelli. 2019. Beyond reading comprehension: exploring the additional contribution of Core Academic Language Skills to early adolescents’ written summaries. Reading and Writing 32, 3 (01 Mar 2019), 729–759. https://doi.org/10.1007/s11145-018-9880-3
  39. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67.
  40. ”Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). Association for Computing Machinery, New York, NY, USA, 1135–1144. https://doi.org/10.1145/2939672.2939778
  41. A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics 8 (2020), 842–866. https://doi.org/10.1162/tacl_a_00349
  42. Design Study Methodology: Reflections from the Trenches and the Stacks. IEEE Transactions on Visualization and Computer Graphics 18, 12 (2012), 2431–2440. https://doi.org/10.1109/TVCG.2012.213
  43. Yeonju Jang Seongyune Choi and Hyeoncheol Kim. 2023. Influence of Pedagogical Beliefs and Perceived Trust on Teachers’ Acceptance of Educational Artificial Intelligence Tools. International Journal of Human–Computer Interaction 39, 4 (2023), 910–922. https://doi.org/10.1080/10447318.2022.2049145 arXiv:https://doi.org/10.1080/10447318.2022.2049145
  44. Angélica M. Silva and Roberto Limongi. 2019. Writing to learn increases long-term memory consolidation: A mental-chronometry and computational-modeling study of “Epistemic writing”. Journal of Writing Research 11, 1 (Jun. 2019), 211–243. https://doi.org/10.17239/jowr-2019.11.01.07
  45. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 (2022).
  46. Kenneth Steimel and Brian Riordan. 2020. Towards instance-based content scoring with pre-trained transformer models. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, Vol. 34.
  47. Seq2seq-Vis: A Visual Debugging Tool for Sequence-to-Sequence Models. IEEE Transactions on Visualization and Computer Graphics 25, 1 (2019), 353–363. https://doi.org/10.1109/TVCG.2018.2865044
  48. Axiomatic Attribution for Deep Networks. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 3319–3328. https://proceedings.mlr.press/v70/sundararajan17a.html
  49. The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 107–118. https://doi.org/10.18653/v1/2020.emnlp-demos.15
  50. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  51. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
  52. Visual learning analytics of educational data: A systematic literature review and research agenda. Computers & Education 122 (2018), 119–135. https://doi.org/10.1016/j.compedu.2018.03.018
  53. Jesse Vig. 2019. A Multiscale Visualization of Attention in the Transformer Model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, Florence, Italy, 37–42. https://doi.org/10.18653/v1/P19-3007
  54. Trick Me If You Can: Human-in-the-Loop Generation of Adversarial Examples for Question Answering. Transactions of the Association for Computational Linguistics 7 (2019), 387–401. https://doi.org/10.1162/tacl_a_00279
  55. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Brussels, Belgium, 353–355. https://doi.org/10.18653/v1/W18-5446
  56. VizSeq: a visual analysis toolkit for text generation tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations. Association for Computational Linguistics, Hong Kong, China, 253–258. https://doi.org/10.18653/v1/D19-3043
  57. Putting Humans in the Natural Language Processing Loop: A Survey. In Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing. Association for Computational Linguistics, Online, 47–52. https://aclanthology.org/2021.hcinlp-1.8
  58. Dodrio: Exploring Transformer Models with Interactive Visualization. In Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 132–141. https://zijie.wang/papers/dodrio/
  59. CNN Explainer: Learning Convolutional Neural Networks with Interactive Visualization. IEEE Transactions on Visualization and Computer Graphics 27, 2 (2021), 1396–1406. https://doi.org/10.1109/TVCG.2020.3030418
  60. Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not Explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 11–20. https://doi.org/10.18653/v1/D19-1002
  61. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38–45. https://doi.org/10.18653/v1/2020.emnlp-demos.6
  62. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453 (2023).
  63. Towards a better understanding of the role of visualization in online learning: A review. Visual Informatics 6, 4 (2022), 22–33. https://doi.org/10.1016/j.visinf.2022.09.002
  64. Visual analytics of potential dropout behavior patterns in online learning based on counterfactual explanation. Journal of Visualization 26, 3 (01 Jun 2023), 723–741. https://doi.org/10.1007/s12650-022-00899-8
Citations (3)

Summary

  • The paper introduces iScore, a novel tool that helps interpret how LLMs automatically score summaries.
  • It employs a user-centered design to compare LLM scores with expert benchmarks and track scoring changes over time.
  • Deployment demonstrated practical gains, including a 3% improvement in scoring accuracy for educational summary evaluations.

Visual Analytics for Evaluating LLM Summaries

Introduction to iScore

iScore represents an innovative approach to interactive visual analytics focused on aiding learning engineers in the interpretation and evaluation of LLMs used for automatic summary scoring. This tool is designed to address the challenges inherent in understanding and trusting the complex mechanisms of LLMs, particularly in educational contexts where these models assess written summaries. Through a close collaboration with learning engineers, iScore was developed as a response to the nuanced demands of interpreting models that score summary writing, allowing for the uploading, scoring, and comparison of multiple summaries against LLM predictions. Key to iScore's functionality are tightly integrated visual components that facilitate iterative refinement of language used in summaries, tracking changes in LLM scores, and a deep dive into model weighting at various levels of detail.

Design and Development Process

The construction of iScore was guided by a user-centered design process, pinpointing distinct operational challenges such as the aggregation of text inputs, tracking of score provenance, and scaling interpretability methods to match the expansive capabilities of LLMs. Identified challenges and derived tasks formed the backbone of iScore's design, emphasizing the need for a tool that supports the broad overview and detailed inspection of how LLMs score written summaries.

Core Features of iScore

iScore encapsulates its functionality across three primary areas:

  • Assignments Panel: This area allows users to upload and score multiple source-summary pairs, supporting manual revision and re-scoring to observe how modifications affect LLM output.
  • Scores Dashboard: This dashboard enables comparison of LLM scores over time and against a backdrop of expert-scored "ground truth" data, assisting in the visualization of how summary changes impact scores.
  • Model Analysis View: Offering two LLM interpretability methods, this view dives into the specifics of model behavior, including a look at how individual words or sentences contribute to overall summary scores.

These features collectively enable learning engineers to probe, understand, and trust the automated scoring processes of their LLMs, enhancing the transparency and reliability of using such models in educational applications.

Case Study and Evaluation

A month-long deployment of iScore, involving learning engineers from the collaborative design team, demonstrated the tool's practical benefits in refining LLM accuracy. Through its use, one engineer improved an LLM's scoring accuracy by three percentage points, underscoring iScore's value in real-world applications. In-depth interviews with the participating engineers revealed that iScore significantly contributed to a deeper understanding of LLM behavior, facilitated rigorous model evaluation, and fostered a greater sense of trust in using LLMs for educational purposes.

Implications for Future Research and Tool Development

The insights gathered from iScore's development and evaluation highlight several areas for future exploration, including the need for advanced statistical analysis capabilities within visual analytics tools and the extension of LLM interpretability techniques to encompass more comprehensive and computationally efficient methods. Additionally, the feedback underscores the potential of tools like iScore to make significant advancements in responsible and ethical AI applications within education, paving the way for more transparent, trustworthy, and effective use of AI in learning environments.

Conclusion

iScore stands as a pivotal development in the intersection of visual analytics, machine learning interpretability, and educational technology, offering a robust platform for learning engineers to interrogate, comprehend, and validate the complex processes underlying LLM-driven summary scoring.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com