Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Investigating Automatic Scoring and Feedback using Large Language Models (2405.00602v1)

Published 1 May 2024 in cs.CL and cs.LG

Abstract: Automatic grading and feedback have been long studied using traditional machine learning and deep learning techniques using LLMs. With the recent accessibility to high performing LLMs like LLaMA-2, there is an opportunity to investigate the use of these LLMs for automatic grading and feedback generation. Despite the increase in performance, LLMs require significant computational resources for fine-tuning and additional specific adjustments to enhance their performance for such tasks. To address these issues, Parameter Efficient Fine-tuning (PEFT) methods, such as LoRA and QLoRA, have been adopted to decrease memory and computational requirements in model fine-tuning. This paper explores the efficacy of PEFT-based quantized models, employing classification or regression head, to fine-tune LLMs for automatically assigning continuous numerical grades to short answers and essays, as well as generating corresponding feedback. We conducted experiments on both proprietary and open-source datasets for our tasks. The results show that prediction of grade scores via finetuned LLMs are highly accurate, achieving less than 3% error in grade percentage on average. For providing graded feedback fine-tuned 4-bit quantized LLaMA-2 13B models outperform competitive base models and achieve high similarity with subject matter expert feedback in terms of high BLEU and ROUGE scores and qualitatively in terms of feedback. The findings from this study provide important insights into the impacts of the emerging capabilities of using quantization approaches to fine-tune LLMs for various downstream tasks, such as automatic short answer scoring and feedback generation at comparatively lower costs and latency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Falcon-40b: an open large language model with state-of-the-art performance. Technical report, Technical report, Technical report, Technology Innovation Institute, 2023.
  2. Palm 2 technical report, 2023.
  3. Improving automated scoring of student open responses in mathematics. International Educational Data Mining Society, 2021.
  4. K. Bostrom and G. Durrett. Byte pair encoding is suboptimal for language model pretraining. arXiv preprint arXiv:2004.03720, 2020.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. The eras and trends of automatic short answer grading. International journal of artificial intelligence in education, 25:60–117, 2015.
  7. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  8. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
  9. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
  10. T. Dettmers and L. Zettlemoyer. The case for 4-bit precision: k-bit inference scaling laws. In International Conference on Machine Learning, pages 7750–7774. PMLR, 2023.
  11. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  12. Sentence encoding for dialogue act classification. Natural Language Engineering, 29(3):794–823, 2023.
  13. Your answer is incorrect… would you like to know why? introducing a bilingual short answer feedback dataset. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8577–8591, 2022.
  14. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  15. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  16. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  17. Automated feedback generation for student project reports: A data-driven approach. Journal of Educational Data Mining, 14(3):132–161, 2022.
  18. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  19. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965, 2022.
  20. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  21. C. Lu and M. Cutumisu. Integrating deep learning into an automated feedback generation system for automated essay scoring. International Educational Data Mining Society, 2021.
  22. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
  23. OpenAI. Gpt-4 technical report, 2023.
  24. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  25. Toward better grade prediction via a2gp–an academic achievement inspired predictive model. International Educational Data Mining Society, 2022.
  26. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  27. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  28. Byte pair encoding: A text compression scheme that accelerates pattern matching. 1999.
  29. Automatic short answer grading and feedback using text mining methods. Procedia computer science, 169:726–743, 2020.
  30. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  31. Llama 2: Open foundation and fine-tuned chat models, 2023.
  32. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  33. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  34. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023.
  35. Raise a child in large language model: Towards effective and generalizable fine-tuning. arXiv preprint arXiv:2109.05687, 2021.
  36. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32, 2019.
  37. A comprehensive study on post-training quantization for large language models. arXiv preprint arXiv:2303.08302, 2023.
  38. Automatic short math answer grading via in-context meta-learning. arXiv preprint arXiv:2205.15219, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Gloria Ashiya Katuka (1 paper)
  2. Alexander Gain (1 paper)
  3. Yen-Yun Yu (7 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets