Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

131 tokens/sec

GPT-4o

10 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

iScore: Visual Analytics for Interpreting How Language Models Automatically Score Summaries (2403.04760v1)

Published 7 Mar 2024 in cs.HC, cs.AI, cs.CY, and cs.LG

Abstract: The recent explosion in popularity of LLMs has inspired learning engineers to incorporate them into adaptive educational tools that automatically score summary writing. Understanding and evaluating LLMs is vital before deploying them in critical learning environments, yet their unprecedented size and expanding number of parameters inhibits transparency and impedes trust when they underperform. Through a collaborative user-centered design process with several learning engineers building and deploying summary scoring LLMs, we characterized fundamental design challenges and goals around interpreting their models, including aggregating large text inputs, tracking score provenance, and scaling LLM interpretability methods. To address their concerns, we developed iScore, an interactive visual analytics tool for learning engineers to upload, score, and compare multiple summaries simultaneously. Tightly integrated views allow users to iteratively revise the language in summaries, track changes in the resulting LLM scores, and visualize model weights at multiple levels of abstraction. To validate our approach, we deployed iScore with three learning engineers over the course of a month. We present a case study where interacting with iScore led a learning engineer to improve their LLM's score accuracy by three percentage points. Finally, we conducted qualitative interviews with the learning engineers that revealed how iScore enabled them to understand, evaluate, and build trust in their LLMs during deployment.

References (64)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces iScore, a novel tool that helps interpret how LLMs automatically score summaries.
It employs a user-centered design to compare LLM scores with expert benchmarks and track scoring changes over time.
Deployment demonstrated practical gains, including a 3% improvement in scoring accuracy for educational summary evaluations.

Visual Analytics for Evaluating LLM Summaries

Introduction to iScore

iScore represents an innovative approach to interactive visual analytics focused on aiding learning engineers in the interpretation and evaluation of LLMs used for automatic summary scoring. This tool is designed to address the challenges inherent in understanding and trusting the complex mechanisms of LLMs, particularly in educational contexts where these models assess written summaries. Through a close collaboration with learning engineers, iScore was developed as a response to the nuanced demands of interpreting models that score summary writing, allowing for the uploading, scoring, and comparison of multiple summaries against LLM predictions. Key to iScore's functionality are tightly integrated visual components that facilitate iterative refinement of language used in summaries, tracking changes in LLM scores, and a deep dive into model weighting at various levels of detail.

Design and Development Process

The construction of iScore was guided by a user-centered design process, pinpointing distinct operational challenges such as the aggregation of text inputs, tracking of score provenance, and scaling interpretability methods to match the expansive capabilities of LLMs. Identified challenges and derived tasks formed the backbone of iScore's design, emphasizing the need for a tool that supports the broad overview and detailed inspection of how LLMs score written summaries.

Core Features of iScore

iScore encapsulates its functionality across three primary areas:

Assignments Panel: This area allows users to upload and score multiple source-summary pairs, supporting manual revision and re-scoring to observe how modifications affect LLM output.
Scores Dashboard: This dashboard enables comparison of LLM scores over time and against a backdrop of expert-scored "ground truth" data, assisting in the visualization of how summary changes impact scores.
Model Analysis View: Offering two LLM interpretability methods, this view dives into the specifics of model behavior, including a look at how individual words or sentences contribute to overall summary scores.

These features collectively enable learning engineers to probe, understand, and trust the automated scoring processes of their LLMs, enhancing the transparency and reliability of using such models in educational applications.

Case Study and Evaluation

A month-long deployment of iScore, involving learning engineers from the collaborative design team, demonstrated the tool's practical benefits in refining LLM accuracy. Through its use, one engineer improved an LLM's scoring accuracy by three percentage points, underscoring iScore's value in real-world applications. In-depth interviews with the participating engineers revealed that iScore significantly contributed to a deeper understanding of LLM behavior, facilitated rigorous model evaluation, and fostered a greater sense of trust in using LLMs for educational purposes.

Implications for Future Research and Tool Development

The insights gathered from iScore's development and evaluation highlight several areas for future exploration, including the need for advanced statistical analysis capabilities within visual analytics tools and the extension of LLM interpretability techniques to encompass more comprehensive and computationally efficient methods. Additionally, the feedback underscores the potential of tools like iScore to make significant advancements in responsible and ethical AI applications within education, paving the way for more transparent, trustworthy, and effective use of AI in learning environments.

Conclusion

iScore stands as a pivotal development in the intersection of visual analytics, machine learning interpretability, and educational technology, offering a robust platform for learning engineers to interrogate, comprehend, and validate the complex processes underlying LLM-driven summary scoring.

PDF Markdown

Tweets

https://twitter.com/AdamCoscia/status/1769754270931222779

YouTube

Show All Videos