Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

When Automated Assessment Meets Automated Content Generation: Examining Text Quality in the Era of GPTs (2309.14488v1)

Published 25 Sep 2023 in cs.CL and cs.AI

Abstract: The use of ML models to assess and score textual data has become increasingly pervasive in an array of contexts including natural language processing, information retrieval, search and recommendation, and credibility assessment of online content. A significant disruption at the intersection of ML and text are text-generating large-LLMs such as generative pre-trained transformers (GPTs). We empirically assess the differences in how ML-based scoring models trained on human content assess the quality of content generated by humans versus GPTs. To do so, we propose an analysis framework that encompasses essay scoring ML-models, human and ML-generated essays, and a statistical model that parsimoniously considers the impact of type of respondent, prompt genre, and the ML model used for assessment model. A rich testbed is utilized that encompasses 18,460 human-generated and GPT-based essays. Results of our benchmark analysis reveal that transformer pretrained LLMs (PLMs) more accurately score human essay quality as compared to CNN/RNN and feature-based ML methods. Interestingly, we find that the transformer PLMs tend to score GPT-generated text 10-15\% higher on average, relative to human-authored documents. Conversely, traditional deep learning and feature-based ML models score human text considerably higher. Further analysis reveals that although the transformer PLMs are exclusively fine-tuned on human text, they more prominently attend to certain tokens appearing only in GPT-generated text, possibly due to familiarity/overlap in pre-training. Our framework and results have implications for text classification settings where automated scoring of text is likely to be disrupted by generative AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (104)
  1. Text analytics to support sense-making in social media: A language-action perspective. MIS Quarterly, 42(2), 2018.
  2. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems (TOIS), 26(2):1–29, 2008.
  3. Selecting attributes for sentiment classification using feature relation networks. IEEE Transactions on Knowledge and Data Engineering, 23(3):447–462, 2010.
  4. Ahmed Abbasi, Fatemeh “Mariam” Zahedi, and Siddharth Kaza. Detecting fake medical web sites using recursive trust labeling. ACM Transactions on Information Systems (TOIS), 30(4):1–36, 2012.
  5. A deep learning architecture for psychometric natural language processing. ACM Transactions on Information Systems (TOIS), 38(1):1–29, 2020.
  6. Automatic text scoring using neural networks. arXiv preprint arXiv:1606.04289, 2016.
  7. Automated essay scoring in the presence of biased ratings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 229–237, 2018.
  8. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  9. A speech-act-based office modeling approach. ACM Transactions on Information Systems (TOIS), 6(2):126–152, 1988.
  10. Large language models and the perils of their hallucinations. Critical Care, 27(1):1–2, 2023.
  11. Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In Lrec, volume 10, pages 2200–2204, 2010.
  12. Education in the era of generative artificial intelligence (ai): Understanding the potential benefits of chatgpt in promoting teaching and learning. Available at SSRN 4337484, 2023.
  13. Stephen P Balfour. Assessing writing in moocs: Automated essay scoring and calibrated peer review™. Research & Practice in Assessment, 8:40–48, 2013.
  14. HJM Beijer and CJ De Blaey. Hospitalisations caused by adverse drug reactions (adr): a meta-analysis of observational studies. Pharmacy World and Science, 24(2):46–54, 2002.
  15. Managing artificial intelligence. MIS quarterly, 45(3), 2021.
  16. Regression or classification? automated essay scoring for norwegian. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 92–102, 2019.
  17. Machine learning based feedback on textual student answers in large courses. Computers and Education: Artificial Intelligence, 3:100081, 2022.
  18. Automated essay scoring using the knn algorithm. In 2008 International Conference on Computer Science and Software Engineering, volume 1, pages 735–738. IEEE, 2008.
  19. Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and country. Applied Measurement in Education, 25(1):27–40, 2012.
  20. Gavin TL Brown. Assessment of student achievement. Routledge, 2017.
  21. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  22. A speech-act-based negotiation protocol: design, implementation, and test use. ACM Transactions on Information Systems (TOIS), 12(4):360–382, 1994.
  23. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016.
  24. A consistent dual-mrc framework for emotion-cause pair extraction. ACM Transactions on Information Systems (TOIS), 41(4):1–27, 2023.
  25. Chatgpt goes to law school. Available at SSRN, 2023.
  26. Automated essay scoring with string kernels and word embeddings. arXiv preprint arXiv:1804.07954, 2018.
  27. Scott A Crossley. Linguistic features in writing quality and development: An overview. Journal of Writing Research, 11(3):415–443, 2020.
  28. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164, 2019.
  29. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  30. Semire Dikli. An overview of automated scoring of essays. The Journal of Technology, Learning and Assessment, 5(1), 2006.
  31. Automatic features for essay scoring–an empirical study. In Proceedings of the 2016 conference on empirical methods in natural language processing, pages 1072–1077, 2016.
  32. Automated chinese essay scoring using pre-trained language models. In CS & IT Conference Proceedings, volume 11. CS & IT Conference Proceedings, 2021.
  33. A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level. Proceedings of the National Academy of Sciences, 119(32):e2123433119, 2022.
  34. Recognising personality traits using facebook status updates. In Proceedings of the international AAAI conference on web and social media, volume 7, pages 14–18, 2013.
  35. Christiane Fellbaum. Wordnet. In Theory and applications of ontology: computer applications, pages 231–243. Springer, 2010.
  36. Controlling linguistic style aspects in neural language generation. arXiv preprint arXiv:1707.02633, 2017.
  37. Cognitive challenges in human–artificial intelligence collaboration: Investigating the path toward productive delegation. Information Systems Research, 33(2):678–696, 2022.
  38. Kshitij Gupta. Data augmentation for automated essay scoring using transformer models. In 2023 International Conference on Artificial Intelligence and Smart Communication (AISC), pages 853–857. IEEE, 2023.
  39. TDNN: A two-stage deep neural network for prompt-independent automated essay scoring. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1088–1097, Melbourne, Australia, July 2018. Association for Computational Linguistics.
  40. Stylized data-to-text generation: A case study in the e-commerce domain. ACM Trans. Inf. Syst., jun 2023. Just Accepted.
  41. Automated essay scoring: A survey of the state of the art. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 6300–6308. International Joint Conferences on Artificial Intelligence Organization, 7 2019.
  42. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, volume 1, page 2, 2019.
  43. Yoon Kim. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, Doha, Qatar, October 2014. Association for Computational Linguistics.
  44. Michael R King and ChatGPT. A conversation on artificial intelligence, chatbots, and plagiarism in higher education. Cellular and Molecular Bioengineering, 16(1):1–2, 2023.
  45. Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models. PLoS digital health, 2(2):e0000198, 2023.
  46. Benchmarking intersectional biases in nlp. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3598–3609, 2022.
  47. Few-shot aspect category sentiment analysis via meta-learning. ACM Transactions on Information Systems (TOIS), 41(1):1–31, 2023.
  48. Generating scholarly content with chatgpt: ethical challenges for medical publishing. The Lancet Digital Health, 5(3):e105–e106, 2023.
  49. Fned: a deep network for fake news early detection on social media. ACM Transactions on Information Systems (TOIS), 38(3):1–33, 2020.
  50. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  51. Chatting about chatgpt: how may ai and gpt impact academia and libraries? Library Hi Tech News, 2023.
  52. Deep learning-based document modeling for personality detection from text. IEEE Intelligent Systems, 32(2):74–79, 2017.
  53. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pages 55–60, 2014.
  54. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013.
  55. Exploring the potential of using an ai language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2):100050, 2023.
  56. A review on the attention mechanism of deep learning. Neurocomputing, 452:48–62, 2021.
  57. OpenAI. Gpt-4 technical report, 2023.
  58. Ellis B Page. The imminence of… grading essays by computer. The Phi Delta Kappan, 47(5):238–243, 1966.
  59. Ellis B Page. The use of the computer in analyzing student essays. International review of education, pages 210–225, 1968.
  60. Automated chinese essay scoring using vector space models. In 2010 4th International Universal Communication Symposium, pages 149–153. IEEE, 2010.
  61. Linguistic inquiry and word count: Liwc [computer software]. Austin, TX: liwc. net, 135, 2007.
  62. Flexible domain adaptation for automated essay scoring using correlated linear regression. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 431–439, Lisbon, Portugal, September 2015. Association for Computational Linguistics.
  63. Hapax legomena and language typology. Journal of Quantitative Linguistics, 15(4):370–378, 2008.
  64. Personality classification based on twitter text using naive bayes, knn and svm. In 2015 international conference on data and software engineering (ICoDSE), pages 170–174. IEEE, 2015.
  65. Automatic skill-oriented question generation and recommendation for intelligent job interviews. ACM Transactions on Information Systems (TOIS), 2023.
  66. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  67. An automated essay scoring systems: a systematic literature review. Artif. Intell. Rev., 55(3):2495–2527, 2022.
  68. Language models and automated essay scoring. arXiv preprint arXiv:1909.09482, 2019.
  69. A primer in bertology: What we know about how bert works. Transactions of the Association for Computational Linguistics, 8:842–866, 2021.
  70. An overview of three approaches to scoring written essays by computer. Practical Assessment, Research, and Evaluation, 7(1):26, 2000.
  71. Using machine learning to translate applicant work history into predictors of performance and turnover. Journal of Applied Psychology, 104(10):1207, 2019.
  72. Relevance assessments for web search evaluation: Should we randomise or prioritise the pooled documents? ACM Transactions on Information Systems (TOIS), 40(4):1–35, 2022.
  73. Automated english digital essay grader using machine learning. In 2019 IEEE International Conference on Engineering, Technology and Education (TALE), pages 1–6. IEEE, 2019.
  74. Automatic assessment of english cefr levels using bert embeddings. In Proceedings of the Eighth Italian Conference on Computational Linguistics, 2021.
  75. Autonomous tools and design: a triple-loop approach to human-machine learning. Communications of the ACM, 62(1):50–57, 2018.
  76. Understanding relevance judgments in legal case retrieval. ACM Transactions on Information Systems (TOIS), 41(3):1–32, 2023.
  77. Exit assessments: Evaluating writing ability through automated essay scoring. 2002.
  78. Automated essay scoring: A cross-disciplinary perspective. Routledge, 2003.
  79. Analytic automated essay scoring based on deep neural networks integrating multidimensional item response theory. In Proceedings of the 29th International Conference on Computational Linguistics, pages 2917–2926, 2022.
  80. Large language models encode clinical knowledge. Nature, 2023.
  81. Wordnet affect: an affective extension of wordnet. In Lrec, volume 4, page 40. Lisbon, Portugal, 2004.
  82. Improving short answer grading using transformer-based pre-training. In Artificial Intelligence in Education: 20th International Conference, AIED 2019, Chicago, IL, USA, June 25-29, 2019, Proceedings, Part I 20, pages 469–481. Springer, 2019.
  83. Personality predictions based on user behavior on the facebook social media platform. IEEE Access, 6:61959–61969, 2018.
  84. A neural approach to automated essay scoring. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1882–1891, Austin, Texas, November 2016. Association for Computational Linguistics.
  85. Christian Terwiesch. Would chat gpt3 get a wharton mba? a prediction based on its performance in the operations management course. Mack Institute for Innovation Management at the Wharton School, University of Pennsylvania, 2023.
  86. H Holden Thorp. Chatgpt is fun, but not an author, 2023.
  87. Relating self reports of writing behaviour and online task execution using a temporal model. Metacognition and Learning, 6:229–253, 2011.
  88. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  89. Masaki Uto. A review of deep-neural automated essay scoring models. Behaviormetrika, 48(2):459–484, 2021.
  90. Learning automated essay scoring models using item-response-theory-based scores to decrease effects of rater biases. IEEE Transactions on Learning Technologies, 14(6):763–776, 2021.
  91. Neural automated essay scoring incorporating handcrafted features. In Proceedings of the 28th international conference on computational linguistics, pages 6077–6088, 2020.
  92. An overview of current research on automated essay grading. Journal of Information Technology Education: Research, 2(1):319–330, 2003.
  93. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  94. Jesse Vig. A multiscale visualization of attention in the transformer model. arXiv preprint arXiv:1906.05714, 2019.
  95. Personalized news recommendation: Methods and challenges. ACM Transactions on Information Systems (TOIS), 41(1):1–50, 2023.
  96. Rating short l2 essays on the cefr scale with gpt-4. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pages 576–584, 2023.
  97. Getting personal: a deep learning artifact for text-based measurement of personality. Information Systems Research, 34(1):194–222, 2023.
  98. Cnn-lstm deep learning architecture for computer vision-based modal frequency detection. Mechanical Systems and signal processing, 144:106885, 2020.
  99. A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 180–189, Portland, Oregon, USA, June 2011. Association for Computational Linguistics.
  100. Genre taxonomy: A knowledge repository of communicative actions. ACM transactions on information systems (TOIS), 19(4):431–456, 2001.
  101. Deep learning based personality recognition from facebook status updates. In 2017 IEEE 8th international conference on awareness science and technology (iCAST), pages 383–387. IEEE, 2017.
  102. Na Zhai and Xiaomei Ma. The effectiveness of automated writing evaluation on writing quality: A meta-analysis. Journal of Educational Computing Research, 61(4):875–900, 2023.
  103. Embedding-based recommender system for job to candidate matching on scale. arXiv preprint arXiv:2107.00221, 2021.
  104. Can gpt-4 perform neural architecture search? arXiv preprint arXiv:2304.10970, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Marialena Bevilacqua (2 papers)
  2. Kezia Oketch (2 papers)
  3. Ruiyang Qin (15 papers)
  4. Will Stamey (1 paper)
  5. Xinyuan Zhang (60 papers)
  6. Yi Gan (2 papers)
  7. Kai Yang (187 papers)
  8. Ahmed Abbasi (20 papers)
Citations (7)
Github Logo Streamline Icon: https://streamlinehq.com