Moderating effects of pre-trained and zero-shot ML on automatic scoring performance

Determine how pre-trained language models (e.g., BERT, fine-tuned GPT-3.5) and zero-shot learning approaches (e.g., Matching Exemplar as Next Sentence Prediction, MeNSP) moderate the performance of machine learning-based automatic scoring systems for science assessments, specifically in terms of machine-human score agreements and scoring accuracy across tasks and contexts.

Background

Within the proposed framework for factors moderating machine-human agreements in ML-based automatic scoring, technical features such as algorithm type and attribute abstraction are identified as crucial contributors. Earlier studies have examined traditional supervised algorithms and feature engineering, while more recent work introduces pre-trained models (e.g., BERT, fine-tuned ChatGPT) and zero-shot methods (e.g., MeNSP).

Despite these advances, the chapter notes that the latest ML approaches have not been thoroughly investigated regarding their influence on assessment performance, leaving unclear to what extent and under what conditions these algorithms improve or alter machine-human agreement and accuracy in scoring complex, open-ended science responses.

References

Although several technical features have been examined by Zhai, Shi et al. (2021), the most updated ML, such as pre-trained or zero-shot approaches – have not been thoroughly investigated. As such, little is currently known about how the most updated ML algorithms moderate machine-based assessment performance.

— AI and Machine Learning for Next Generation Science Assessments (2405.06660 - Zhai, 23 Apr 2024) in Section: A Framework Accounting for Automatic Scoring Accuracy

Moderating effects of pre-trained and zero-shot ML on automatic scoring performance

Background

References

Related Problems