Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SimGrade: Using Code Similarity Measures for More Accurate Human Grading (2403.14637v1)

Published 19 Feb 2024 in cs.CY

Abstract: While the use of programming problems on exams is a common form of summative assessment in CS courses, grading such exam problems can be a difficult and inconsistent process. Through an analysis of historical grading patterns we show that inaccurate and inconsistent grading of free-response programming problems is widespread in CS1 courses. These inconsistencies necessitate the development of methods to ensure more fairer and more accurate grading. In subsequent analysis of this historical exam data we demonstrate that graders are able to more accurately assign a score to a student submission when they have previously seen another submission similar to it. As a result, we hypothesize that we can improve exam grading accuracy by ensuring that each submission that a grader sees is similar to at least one submission they have previously seen. We propose several algorithms for (1) assigning student submissions to graders, and (2) ordering submissions to maximize the probability that a grader has previously seen a similar solution, leveraging distributed representations of student code in order to measure similarity between submissions. Finally, we demonstrate in simulation that these algorithms achieve higher grading accuracy than the current standard random assignment process used for grading.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. code2vec: Learning distributed representations of code. CoRR, abs/1803.09473, 2018.
  2. Powergrading: a clustering approach to amplify human effort for short answer grading. Transactions of the ACL, October 2013.
  3. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
  4. B. Efron. Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7(1):1–26, Jan. 1979.
  5. Code vectors: Understanding programs through embedded abstracted symbolic traces. CoRR, abs/1803.06686, 2018.
  6. The boss online submission and assessment system. J. Educ. Resour. Comput., 5(3):2–es, Sept. 2005.
  7. Learning and evaluating contextual embedding of source code, 2019.
  8. Mathematical language processing: Automatic grading and feedback for open response mathematical questions. In Proceedings of the Second (2015) ACM Conference on Learning @ Scale, L@S ’15, pages 167–176, New York, NY, USA, 2015. ACM.
  9. S. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129–137, 1982.
  10. A. Merceron and K. Yucef. Clustering students to help evaluate learning. Technology Enhanced Learning, 2004.
  11. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc., 2013.
  12. Codewebs: scalable homework search for massive open online programming courses. In Proceedings of the 23rd international conference on World wide web, pages 491–502, 2014.
  13. Automatic grading and feedback using program repair for introductory programming courses. In Proceedings of the 2017 ACM Conference on Innovation and Technology in Computer Science Education, pages 92–97, 2017.
  14. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
  15. Learning program embeddings to propagate feedback on student code, 2015.
  16. Gradescope: A fast, flexible, and fair system for scalable assessment of handwritten work. In Proceedings of the Fourth (2017) ACM Conference on Learning @ Scale, L@S ’17, pages 81–88, New York, NY, USA, 2017. ACM.
  17. Are tests comprising both multiple-choice and free-response items necessarily less unidimensional than multiple-choice tests?an analysis of two tests. Journal of Educational Measurement, 31(2):113–123, 1994.
  18. Data-driven feedback generator for online programing courses. In Proceedings of the Fourth (2017) ACM Conference on Learning@ Scale, pages 257–260, 2017.
  19. Zero shot learning for code education: Rubric sampling with deep learning inference, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Sonja Johnson-Yu (2 papers)
  2. Nicholas Bowman (1 paper)
  3. Mehran Sahami (5 papers)
  4. Chris Piech (33 papers)
Citations (4)
X Twitter Logo Streamline Icon: https://streamlinehq.com