Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Automated Assessment of Students' Code Comprehension using LLMs (2401.05399v1)

Published 19 Dec 2023 in cs.CY, cs.AI, and cs.CL

Abstract: Assessing student's answers and in particular natural language answers is a crucial challenge in the field of education. Advances in machine learning, including transformer-based models such as LLMs(LLMs), have led to significant progress in various natural language tasks. Nevertheless, amidst the growing trend of evaluating LLMs across diverse tasks, evaluating LLMs in the realm of automated answer assesment has not received much attention. To address this gap, we explore the potential of using LLMs for automated assessment of student's short and open-ended answer. Particularly, we use LLMs to compare students' explanations with expert explanations in the context of line-by-line explanations of computer programs. For comparison purposes, we assess both LLMs and encoder-based Semantic Textual Similarity (STS) models in the context of assessing the correctness of students' explanation of computer code. Our findings indicate that LLMs, when prompted in few-shot and chain-of-thought setting perform comparable to fine-tuned encoder-based models in evaluating students' short answers in programming domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. In-context examples selection for machine translation. arXiv preprint arXiv:2212.02437, 2022.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. Investigating transformers for automatic short answer grading. In Artificial Intelligence in Education: 21st International Conference, AIED 2020, Ifrane, Morocco, July 6–10, 2020, Proceedings, Part II 21, pages 43–48. Springer, 2020.
  4. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055, 2017.
  5. Universal sentence encoder. arXiv preprint arXiv:1803.11175, 2018.
  6. Augmenting textual qualitative features in deep convolution recurrent neural network for automatic essay scoring. In Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications, pages 93–102, 2018.
  7. Towards effective tutorial feedback for explanation questions: A dataset and baselines. In Proceedings of the 2012 conference of the North American Chapter of the Association for Computational Linguistics: Human language technologies, pages 200–210. Association for Computational Linguistics, 2012.
  8. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155, 2020.
  9. Autograding" explain in plain english" questions using nlp. In Proceedings of the 52nd ACM Technical Symposium on Computer Science Education, pages 1163–1169, 2021.
  10. Text encoders lack knowledge: Leveraging generative llms for domain-specific semantic textual similarity. arXiv preprint arXiv:2309.06541, 2023.
  11. Ets: Domain adaptation and stacking for short answer scoring. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 275–279, 2013.
  12. Live catalog of smart learning objects for computer science education. In Sixth SPLICE Workshop, 2020.
  13. Towards improving open student answer assessment using pretrained transformers. In The International FLAIRS Conference Proceedings, volume 34, 2021.
  14. C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 37:389–405, 2003.
  15. Codehelp: Using large language models with guardrails for scalable support in programming classes. arXiv e-prints, pages arXiv–2308, 2023.
  16. Multiple data augmentation strategies for improving performance on automatic short answer scoring. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 13389–13396, 2020.
  17. Self-explanation and metacognition: The dynamics of reading. In Handbook of metacognition in education, pages 60–81. Routledge, 2009.
  18. Experiences from using code explanations generated by large language models in a web software development e-book. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1., pages 931–937, Toronto, Ontario, Canada, 2023.
  19. Text-to-text semantic similarity for automatic short answer grading. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 567–575, 2009.
  20. Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 752–762, 2011.
  21. Improving code comprehension through scaffolded self-explanations. In International Conference on Artificial Intelligence in Education, pages 478–483. Springer, 2023.
  22. OpenAI. Gpt-4 technical report. CoRR, abs/2303.08774, 2023.
  23. Predicting the semantic textual similarity with siamese cnn and lstm. arXiv preprint arXiv:1810.10641, 2018.
  24. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019a. URL http://arxiv.org/abs/1908.10084.
  25. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019b.
  26. Automatic generation of programming exercises and code explanations using large language models. In Proceedings of the 2022 ACM Conference on International Computing Education Research - Volume 1, ICER ’22, page 27–43, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450391948. 10.1145/3501385.3543957. URL https://doi.org/10.1145/3501385.3543957.
  27. Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts (ddiextraction 2013). Association for Computational Linguistics, 2013.
  28. Fast and easy short answer grading with high accuracy. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1070–1075, 2016.
  29. Pre-training bert on domain resources for short answer grading. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6071–6075, 2019.
  30. A comparative study of free self-explanations and socratic tutoring explanations for source code comprehension. In Proceedings of the 52nd ACM Technical Symposium on Computer Science Education, pages 219–225, 2021.
  31. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  32. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  33. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426, 2017.
  34. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
  35. Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert. arXiv preprint arXiv:2302.10198, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Priti Oli (6 papers)
  2. Rabin Banjade (6 papers)
  3. Jeevan Chapagain (4 papers)
  4. Vasile Rus (6 papers)
Citations (2)