Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unveiling the Tapestry of Automated Essay Scoring: A Comprehensive Investigation of Accuracy, Fairness, and Generalizability (2401.05655v1)

Published 11 Jan 2024 in cs.CL

Abstract: Automatic Essay Scoring (AES) is a well-established educational pursuit that employs machine learning to evaluate student-authored essays. While much effort has been made in this area, current research primarily focuses on either (i) boosting the predictive accuracy of an AES model for a specific prompt (i.e., developing prompt-specific models), which often heavily relies on the use of the labeled data from the same target prompt; or (ii) assessing the applicability of AES models developed on non-target prompts to the intended target prompt (i.e., developing the AES models in a cross-prompt setting). Given the inherent bias in machine learning and its potential impact on marginalized groups, it is imperative to investigate whether such bias exists in current AES methods and, if identified, how it intervenes with an AES model's accuracy and generalizability. Thus, our study aimed to uncover the intricate relationship between an AES model's accuracy, fairness, and generalizability, contributing practical insights for developing effective AES models in real-world education. To this end, we meticulously selected nine prominent AES methods and evaluated their performance using seven metrics on an open-sourced dataset, which contains over 25,000 essays and various demographic information about students such as gender, English language learner status, and economic status. Through extensive evaluations, we demonstrated that: (1) prompt-specific models tend to outperform their cross-prompt counterparts in terms of predictive accuracy; (2) prompt-specific models frequently exhibit a greater bias towards students of different economic statuses compared to cross-prompt models; (3) in the pursuit of generalizability, traditional machine learning models coupled with carefully engineered features hold greater potential for achieving both high accuracy and fairness than complex neural network models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Software available from tensorflow.org.
  2. Abdu-Raheem, B. 2015. Parents’ Socio-Economic Status as Predictor of Secondary School Students’ Academic Performance in Ekiti State, Nigeria. Journal of Education and practice, 6(1): 123–128.
  3. Automatic Text Scoring Using Neural Networks. CoRR, abs/1606.04289.
  4. Learning analytics and fairness: do existing algorithms serve everyone equally? In International Conference on Artificial Intelligence in Education, 71–75. Springer.
  5. Automated Essay Scoring by Maximizing Human-Machine Agreement. In Conference on Empirical Methods in Natural Language Processing.
  6. Automated essay scoring by capturing relative writing quality. The Computer Journal, 57(9): 1318–1330.
  7. Beyond essay length: evaluating e-rater®’s performance on toefl® essays. ETS Research Report Series, 2004(1): i–38.
  8. Automated essay scoring with string kernels and word embeddings. arXiv preprint arXiv:1804.07954.
  9. A large-scale corpus for assessing written argumentation: PERSUADE 2.0.
  10. The persuasive essays for rating, selecting, and understanding argumentative and discourse elements (PERSUADE) corpus 1.0. Assessing Writing, 54: 100667.
  11. Academic Literacy: The Importance and Impact of Writing across the Curriculum–A Case Study. Journal of the Scholarship of Teaching and Learning, 10(2): 34–47.
  12. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805.
  13. Individual Fairness Evaluation for Automated Essay Scoring System. International Educational Data Mining Society.
  14. Automatic features for essay scoring–an empirical study. In Proceedings of the 2016 conference on empirical methods in natural language processing, 1072–1077.
  15. Attention-based recurrent convolutional neural network for automatic essay scoring. In Proceedings of the 21st conference on computational natural language learning (CoNLL 2017), 153–162.
  16. Haberman, S. J. 2019. Measures of agreement versus measures of prediction accuracy. ETS Research Report Series, 2019(1): 1–23.
  17. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear, 7(1): 411–420.
  18. Evaluating Fairness and Generalizability in Models Predicting On-Time Graduation from College Applications. International Educational Data Mining Society.
  19. TDNN: a two-stage deep neural network for prompt-independent automated essay scoring. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1088–1097.
  20. Joachims, T. 2002. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 133–142.
  21. Automated essay scoring: A review of the field. In 2021 International Conference on Computer, Information and Telecommunication Systems (CITS), 1–6. IEEE.
  22. Larkey, L. S. 1998. Automatic Essay Grading Using Text Categorization Techniques. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’98, 90–95. New York, NY, USA: Association for Computing Machinery. ISBN 1581130155.
  23. Moral Machines or Tyranny of the Majority? A Systematic Review on Predictive Bias in Education. In LAK23: 13th International Learning Analytics and Knowledge Conference, 499–508.
  24. SEDNN: Shared and enhanced deep neural network model for cross-prompt automated essay scoring. Knowledge-Based Systems, 210: 106491.
  25. A fairness evaluation of automated methods for scoring text evidence usage in writing. In International Conference on Artificial Intelligence in Education, 255–267. Springer.
  26. MFDNN: Mixed Features Deep Neural Network Model for Prompt-independent Automated Essay Scoring. In Proceedings of the 2021 4th International Conference on Algorithms, Computing and Artificial Intelligence, 1–7.
  27. Nltk: The natural language toolkit. arXiv preprint cs/0205028.
  28. The many dimensions of algorithmic fairness in educational applications. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, 1–10.
  29. RSMTool: A Collection of Tools for Building and Evaluating Automated Scoring Models. Journal of Open Source Software, 1(3).
  30. ASAP++: Enriching the ASAP automated essay grading dataset with essay attribute scores. In Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018).
  31. A hierarchical classification approach to automated essay scoring. Assessing Writing, 23: 35–59.
  32. Evaluation of Text Coherence for Electronic Essay Scoring Systems. Nat. Lang. Eng., 10(1): 25–55.
  33. Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2): 100050.
  34. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
  35. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12: 2825–2830.
  36. Stanza: A Python natural language processing toolkit for many human languages. arXiv preprint arXiv:2003.07082.
  37. Prompt agnostic essay scorer: a domain generalization approach to cross-prompt automated essay scoring. arXiv preprint arXiv:2008.01441.
  38. Language models and automated essay scoring. arXiv preprint arXiv:1909.09482.
  39. Automated essay scoring using Bayes’ theorem. The Journal of Technology, Learning and Assessment, 1(2).
  40. In-Processing fairness improvement methods for regression Data-Driven building Models: Achieving uniform energy prediction. Energy and Buildings, 277: 112565.
  41. A Neural Approach to Automated Essay Scoring. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 1882–1891. Austin, Texas: Association for Computational Linguistics.
  42. SKIPFLOW: Incorporating Neural Coherence Features for End-to-End Automatic Text Scoring. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18. AAAI Press. ISBN 978-1-57735-800-8.
  43. A framework for evaluation and use of automated scoring. Educational measurement: issues and practice, 31(1): 2–13.
  44. Enhancing automated essay scoring performance via fine-tuning pre-trained language models with combination of regression and ranking. In Findings of the Association for Computational Linguistics: EMNLP 2020, 1560–1569.
  45. Task-independent features for automated essay grading. In Proceedings of the tenth workshop on innovative use of NLP for building educational applications, 224–232.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Kaixun Yang (7 papers)
  2. Mladen Raković (5 papers)
  3. Yuyang Li (22 papers)
  4. Quanlong Guan (8 papers)
  5. Dragan Gašević (32 papers)
  6. Guanliang Chen (11 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets