Automated Text Scoring in the Age of Generative AI for the GPU-poor (2407.01873v1)
Abstract: Current research on generative LLMs (GLMs) for automated text scoring (ATS) has focused almost exclusively on querying proprietary models via Application Programming Interfaces (APIs). Yet such practices raise issues around transparency and security, and these methods offer little in the way of efficiency or customizability. With the recent proliferation of smaller, open-source models, there is the option to explore GLMs with computers equipped with modest, consumer-grade hardware, that is, for the "GPU poor." In this study, we analyze the performance and efficiency of open-source, small-scale GLMs for ATS. Results show that GLMs can be fine-tuned to achieve adequate, though not state-of-the-art, performance. In addition to ATS, we take small steps towards analyzing models' capacity for generating feedback by prompting GLMs to explain their scores. Model-generated feedback shows promise, but requires more rigorous evaluation focused on targeted use cases.
- Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone, May 2024. arXiv:2404.14219 [cs].
- AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 1, 2024.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Utilizing openai’s gpt-4 for written feedback. TESOL Journal, 759:e759, 2023.
- LLMs in Short Answer Scoring: Limitations and Promise of Zero-Shot and Few-Shot Approaches. In Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), pages 309–315, 2024.
- How is ChatGPT’s behavior changing over time?, October 2023. arXiv:2307.09009 [cs].
- Language Models as Science Tutors, February 2024. arXiv:2402.11111 [cs].
- Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, September 2014. arXiv:1406.1078 [cs, stat].
- Scaling Instruction-Finetuned Language Models, December 2022. arXiv:2210.11416 [cs].
- ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. Technical Report arXiv:2003.10555, arXiv, March 2020. arXiv:2003.10555 [cs] type: article.
- 8-bit Optimizers via Block-wise Quantization, June 2022. arXiv:2110.02861 [cs].
- QLoRA: Efficient Finetuning of Quantized LLMs. Advances in Neural Information Processing Systems, 36:10088–10115, December 2023.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Technical Report arXiv:1810.04805, arXiv, May 2018. arXiv:1810.04805 [cs] type: article.
- Attention-based Recurrent Convolutional Neural Network for Automatic Essay Scoring. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 153–162, Vancouver, Canada, August 2017. Association for Computational Linguistics.
- Unlocking the power of generative AI models and systems such as GPT-4 and ChatGPT for higher education: A guide for students and lecturers. Working Paper 02-2023, Hohenheim Discussion Papers in Business, Economics and Social Sciences, 2023.
- Towards End-To-End Speech Recognition with Recurrent Neural Networks. In Proceedings of the 31st International Conference on Machine Learning, pages 1764–1772. PMLR, June 2014. ISSN: 1938-7228.
- Hybrid speech recognition with Deep Bidirectional LSTM. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pages 273–278, December 2013.
- The Substitution Augmentation Modification Redefinition (SAMR) Model: a Critical Review and Suggestions for its Use. TechTrends, 60(5):433–441, September 2016.
- DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing, December 2021. Number: arXiv:2111.09543 arXiv:2111.09543 [cs].
- Long Short-Term Memory. Neural Computation, 9(8):1735–1780, November 1997.
- LoRA: Low-Rank Adaptation of Large Language Models, October 2021. arXiv:2106.09685 [cs].
- Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12):1–38, December 2023.
- Mistral 7B, October 2023. arXiv:2310.06825 [cs].
- Short answer scoring with GPT-4. 2024.
- ConvBERT: Improving BERT with Span-based Dynamic Convolution. In Advances in Neural Information Processing Systems, volume 33, pages 12837–12848. Curran Associates, Inc., 2020.
- Chatgpt needs spade (sustainability, privacy, digital divide, and ethics) evaluation: A review. Cognitive Computation, pages 1–23, 2024.
- Automated Essay Scoring and the Deep Learning Black Box: How Are Rubric Scores Determined? International Journal of Artificial Intelligence in Education, 31(3):538–584, September 2021.
- Get IT Scored Using AutoSAS — An Automated System for Scoring Short Answers. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):9662–9669, July 2019. Number: 01.
- Explainable AI: A Review of Machine Learning Interpretability Methods. Entropy, 23(1):18, January 2021. Number: 1 Publisher: Multidisciplinary Digital Publishing Institute.
- Psychometric Considerations When Using Deep Learning for Automated Scoring. In Advancing Natural Language Processing in Educational Assessment. Routledge, 2023. Num Pages: 16.
- Can Large Language Models Automatically Score Proficiency of Written Essays?, April 2024. arXiv:2403.06149 [cs].
- AI Meta. Introducing meta llama 3: The most capable openly available llm to date. Meta AI, 2024.
- Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2):100050, August 2023.
- OpenAI. GPT-4 Technical Report, March 2023. arXiv:2303.08774 [cs].
- Christopher Ormerod. Short-answer scoring with ensembles of pretrained language models, February 2022. arXiv:2202.11558 [cs].
- Argumentation Element Annotation Modeling using XLNet, November 2023. arXiv:2311.06239 [cs].
- Automated essay scoring using efficient transformer-based language models, February 2021. Number: arXiv:2102.13136 arXiv:2102.13136 [cs].
- Ellis Batten Page. Project Essay Grade: PEG. In Automated essay scoring: A cross-disciplinary perspective, pages 43–54. Lawrence Erlbaum Associates Publishers, Mahwah, NJ, US, 2003.
- MathBERT: A Pre-Trained Model for Mathematical Formula Understanding, May 2021. arXiv:2105.00377 [cs].
- Adapters: A Unified Library for Parameter-Efficient and Modular Transfer Learning. In Yansong Feng and Els Lefever, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 149–160, Singapore, December 2023. Association for Computational Linguistics.
- Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023.
- Improving Language Understanding by Generative Pre-training, 2018.
- Language models and Automated Essay Scoring, September 2019. Number: arXiv:1909.09482 arXiv:1909.09482 [cs, stat].
- Mark D. Shermis. State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. Assessing Writing, 20:53–76, April 2014.
- Mark D. Shermis. Contrasting State-of-the-Art in the Machine Scoring of Short-Form Constructed Responses. Educational Assessment, 20(1):46–65, January 2015. Publisher: Routledge _eprint: https://doi.org/10.1080/10627197.2015.997617.
- Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation, April 2024. arXiv:2404.15845 [cs].
- MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices, April 2020. Number: arXiv:2004.02984 arXiv:2004.02984 [cs].
- Pre-Training BERT on Domain Resources for Short Answer Grading. In EMNLP, 2019.
- A Neural Approach to Automated Essay Scoring. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1882–1891, Austin, Texas, November 2016. Association for Computational Linguistics.
- Gemma: Open Models Based on Gemini Research and Technology, April 2024. arXiv:2403.08295 [cs].
- Automated Short-Answer Grading Using Deep Neural Networks and Item Response Theory. AIED, 2020.
- Neural Automated Essay Scoring Incorporating Handcrafted Features. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6077–6088, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics.
- Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- A Framework for Evaluation and Use of Automated Scoring. Educational Measurement: Issues and Practice, 31(1):2–13, 2012. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1745-3992.2011.00223.x.
- Human-AI Collaborative Essay Scoring: A Dual-Process Framework with LLMs, January 2024.
- From Automation to Augmentation: Large Language Models Elevating Essay Scoring Landscape, January 2024. arXiv:2401.06431 [cs] version: 1.
- Automated Essay Scoring via Pairwise Contrastive Regression. In Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, and Seung-Hoon Na, editors, Proceedings of the 29th International Conference on Computational Linguistics, pages 2724–2733, Gyeongju, Republic of Korea, October 2022. International Committee on Computational Linguistics.
- Defending ChatGPT against jailbreak attack via self-reminders. Nature Machine Intelligence, 5(12):1486–1496, December 2023. Publisher: Nature Publishing Group.
- Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment, December 2023. arXiv:2312.12148 [cs].
- Enhancing Automated Essay Scoring Performance via Fine-tuning Pre-trained Language Models with Combination of Regression and Ranking. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1560–1569, Online, November 2020. Association for Computational Linguistics.
- Cognitive Mirage: A Review of Hallucinations in Large Language Models, September 2023. arXiv:2309.06794 [cs].