Generative Students: Using LLM-Simulated Student Profiles to Support Question Item Evaluation (2405.11591v1)
Abstract: Evaluating the quality of automatically generated question items has been a long standing challenge. In this paper, we leverage LLMs to simulate student profiles and generate responses to multiple-choice questions (MCQs). The generative students' responses to MCQs can further support question item evaluation. We propose Generative Students, a prompt architecture designed based on the KLI framework. A generative student profile is a function of the list of knowledge components the student has mastered, has confusion about or has no evidence of knowledge of. We instantiate the Generative Students concept on the subject domain of heuristic evaluation. We created 45 generative students using GPT-4 and had them respond to 20 MCQs. We found that the generative students produced logical and believable responses that were aligned with their profiles. We then compared the generative students' responses to real students' responses on the same set of MCQs and found a high correlation. Moreover, there was considerable overlap in the difficult questions identified by generative students and real students. A subsequent case study demonstrated that an instructor could improve question quality based on the signals provided by Generative Students.
- [n. d.]. Incorporating ChatGPT Into Your Teaching. https://miamioh.edu/cte/faculty-staff/chatgpt/. Accessed: 2024-02-16.
- [n. d.]. OpenAI Chat. https://chat.openai.com/. Accessed: 2024-02-16.
- [n. d.]. Using ChatGPT to Write Quiz Questions. https://online.ucla.edu/using-chatgpt-to-write-quiz-questions/. Accessed: 2024-02-16.
- Multi-party goal tracking with llms: Comparing pre-training, fine-tuning, and prompt engineering. arXiv preprint arXiv:2308.15231 (2023).
- Ontology-based multiple choice question generation. KI-Künstliche Intelligenz 30, 2 (2016), 183–188.
- Benjamin S Bloom. 1984. The 2 sigma problem: The search for methods of group instruction as effective as one-to-one tutoring. Educational researcher 13, 6 (1984), 4–16.
- TAXONOMY MADE EASY BLOOM’S. 1965. Bloom’s taxonomy of educational objectives. Longman.
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- Michelene TH Chi and Ruth Wylie. 2014. The ICAP framework: Linking cognitive engagement to active learning outcomes. Educational Psychologist 49, 4 (2014), 219–243.
- Evaluation of academic performance based on learning analytics and ontology: a systematic mapping study. In 2018 IEEE Frontiers in Education Conference (FIE). IEEE, 1–5.
- Lee J Cronbach. 1951. Coefficient alpha and the internal structure of tests. psychometrika 16, 3 (1951), 297–334.
- Catherine H Crouch and Eric Mazur. 2001. Peer instruction: Ten years of experience and results. American journal of physics 69, 9 (2001), 970–977.
- Automatic question generation and answer assessment: a survey. Research and Practice in Technology Enhanced Learning 16 (2021), 1–15.
- Language models show human-like content effects on reasoning. arXiv preprint arXiv:2207.07051 (2022).
- Measuring actual learning versus feeling of learning in response to being actively engaged in the classroom. Proceedings of the National Academy of Sciences 116, 39 (2019), 19251–19257.
- K Anders Ericsson et al. 2006. The influence of experience and deliberate practice on the development of superior expert performance. The Cambridge handbook of expertise and expert performance 38 (2006), 685–705.
- The role of deliberate practice in the acquisition of expert performance. Psychological review 100, 3 (1993), 363.
- Education Testing Services (ETS). [n. d.]. Reliability and Comparability of TOEFL iBT Scores. Technical Report.
- ChEDDAR: Student-ChatGPT Dialogue in EFL Writing Education. arXiv preprint arXiv:2309.13243 (2023).
- Deborah Harris. 1989. Comparison of 1-, 2-, and 3-parameter IRT models. Educational Measurement: Issues and Practice 8, 1 (1989), 35–41.
- Review of recent systems for automatic assessment of programming assignments. In Proceedings of the 10th Koli calling international conference on computing education research. 86–93.
- Teach AI How to Code: Using Large Language Models as Teachable Agents for Programming Education. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–28.
- Ahmad Zamri Khairani and Hasni Shamsuddin. 2016. Assessing item difficulty and discrimination indices of teacher-developed multiple-choice tests. In Assessment for Learning Within and Beyond the Classroom: Taylor’s 8th Teaching and Learning Conference 2015 Proceedings. Springer, 417–426.
- Intelligent tutoring goes to school in the big city. International Journal of Artificial Intelligence in Education (IJAIED) 8 (1997), 30–43.
- The Knowledge-Learning-Instruction framework: Bridging the science-practice chasm to enhance robust student learning. Cognitive science 36, 5 (2012), 757–798.
- Learning is not a spectator sport: Doing is better than watching for learning from a MOOC. In Proceedings of the second (2015) ACM conference on learning@ scale. ACM, 111–120.
- A Systematic Review of Automatic Question Generation for Educational Purposes. International Journal of Artificial Intelligence in Education 30, 1 (2020), 121–204.
- ReadingQuizMaker: A Human-NLP Collaborative System that Supports Instructors to Design High-Quality Reading Quiz Questions. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–18.
- Mukta Majumder and Sujan Kumar Saha. 2014. Automatic selection of informative sentences: The sentences that can generate multiple choice questions. Knowledge Management & E-Learning: An International Journal 6 (2014), 377–391.
- Mukta Majumder and Sujan Kumar Saha. 2015. A System for Generating Multiple Choice Questions: With a Novel Approach for Sentence Selection. In NLP-TEA@ACL/IJCNLP.
- GPTeach: Interactive TA Training with GPT Based Students. (2023).
- Bill Moggridge and Bill Atkinson. 2007. Designing interactions. Vol. 17. MIT press Cambridge.
- Assessing the quality of multiple-choice questions using gpt-4 and rule-based methods. In European Conference on Technology Enhanced Learning. Springer, 229–245.
- Crowdsourcing the Evaluation of Multiple-Choice Questions Using Item-Writing Flaws and Bloom’s Taxonomy. (2023).
- Prompting ai art: An investigation into the creative skill of prompt engineering. arXiv preprint arXiv:2303.13534 (2023).
- Human-like problem-solving abilities in large language models using ChatGPT. Frontiers in Artificial Intelligence 6 (2023), 1199350.
- Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. 1–22.
- Predicting question quality using recurrent neural networks. In Artificial Intelligence in Education: 19th International Conference, AIED 2018, London, UK, June 27–30, 2018, Proceedings, Part I 19. Springer, 491–502.
- Rehearsal: Simulating conflict to teach conflict resolution. arXiv preprint arXiv:2309.12309 (2023).
- Large Language Models Can Be Easily Distracted by Irrelevant Context. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 31210–31227. https://proceedings.mlr.press/v202/shi23a.html
- What’s In It for the Learners? Evidence from a Randomized Field Experiment on Learnersourcing Questions in a MOOC. In Proceedings of the Eighth ACM Conference on Learning@ Scale. 221–233.
- Katherine Stasaski and Marti A Hearst. 2017. Multiple choice question generation utilizing an ontology. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications. 303–312.
- Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291 (2023).
- A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432 (2023).
- Seeing beyond expert blind spots: Online learning design for scale and quality. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–14.
- UpGrade: Sourcing Student Open-Ended Solutions to Create Scalable Learning Opportunities. In Proceedings of the Sixth (2019) ACM Conference on Learning @ Scale (Chicago, IL, USA) (L@S ’19). ACM, New York, NY, USA, Article 17, 10 pages. https://doi.org/10.1145/3330430.3333614
- QG-Net: A Data-Driven Question Generation Model for Educational Content. In Proceedings of the Fifth Annual ACM Conference on Learning at Scale (London, United Kingdom) (L@S ’18). Association for Computing Machinery, New York, NY, USA, Article 7, 10 pages. https://doi.org/10.1145/3231644.3231654
- Towards Blooms Taxonomy Classification Without Labels. In Artificial Intelligence in Education, Ido Roll, Danielle McNamara, Sergey Sosnovsky, Rose Luckin, and Vania Dimitrova (Eds.). Springer International Publishing, Cham, 433–445.
- Towards human-like educational question generation with large language models. In International conference on artificial intelligence in education. Springer, 153–166.
- Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. arXiv preprint arXiv:2310.00746 (2023).
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
- AXIS: Generating Explanations at Scale with Learnersourcing and Machine Learning. In Proceedings of the Third (2016) ACM Conference on Learning @ Scale (Edinburgh, Scotland, UK) (L@S ’16). ACM, New York, NY, USA, 379–388. https://doi.org/10.1145/2876034.2876042
- Mark Wilson and Paul De Boeck. 2004. Descriptive and explanatory item response models. In Explanatory item response models. Springer, 43–74.
- Songlin Xu and Xinyu Zhang. 2023. Leveraging generative artificial intelligence to simulate student learning behavior. arXiv preprint arXiv:2310.19206 (2023).
- Exploring large language models for communication games: An empirical study on werewolf. arXiv preprint arXiv:2309.04658 (2023).
- QMaps: Engaging Students in Voluntary Question Generation and Linking (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–14. https://doi.org/10.1145/3313831.3376882
- Sotopia: Interactive evaluation for social intelligence in language agents. arXiv preprint arXiv:2310.11667 (2023).