Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating and Optimizing Educational Content with Large Language Model Judgments (2403.02795v2)

Published 5 Mar 2024 in cs.AI and cs.CL

Abstract: Creating effective educational materials generally requires expensive and time-consuming studies of student learning outcomes. To overcome this barrier, one idea is to build computational models of student learning and use them to optimize instructional materials. However, it is difficult to model the cognitive processes of learning dynamics. We propose an alternative approach that uses LLMs (LMs) as educational experts to assess the impact of various instructions on learning outcomes. Specifically, we use GPT-3.5 to evaluate the overall effect of instructional materials on different student groups and find that it can replicate well-established educational findings such as the Expertise Reversal Effect and the Variability Effect. This demonstrates the potential of LMs as reliable evaluators of educational content. Building on this insight, we introduce an instruction optimization approach in which one LM generates instructional materials using the judgments of another LM as a reward function. We apply this approach to create math word problem worksheets aimed at maximizing student learning gains. Human teachers' evaluations of these LM-generated worksheets show a significant alignment between the LM judgments and human teacher preferences. We conclude by discussing potential divergences between human and LM opinions and the resulting pitfalls of automating instructional design.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Using large language models to simulate multiple humans and replicate human subject studies. In International Conference on Machine Learning, pages 337–371. PMLR, 2023.
  3. Cognitive tutors: Lessons learned. The journal of the learning sciences, 4(2):167–207, 1995.
  4. Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3):337–351, 2023.
  5. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  6. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  7. Can we trust ai-generated educational content? comparative analysis of human and ai-generated learning resources. arXiv preprint arXiv:2306.10509, 2023.
  8. Student modelling: the key to individualized knowledge-based instruction, volume 125. Springer Science & Business Media, 2013.
  9. Solving math word problems by combining language models with symbolic solvers. arXiv preprint arXiv:2304.09102, 2023.
  10. J. J. Horton. Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research, 2023.
  11. " teach ai how to code": Using large language models as teachable agents for programming education. arXiv preprint arXiv:2309.14534, 2023.
  12. Cgmi: Configurable general multi-agent interaction framework. arXiv preprint arXiv:2308.12503, 2023.
  13. S. Kalyuga. Expertise reversal effect and its implications for learner-tailored instruction. Educational psychology review, 19:509–539, 2007.
  14. The expertise reversal effect. Educational Psychologist, 38(1):23–31, 2003.
  15. S. Kalyuga and J. Sweller. Measuring knowledge to optimize cognitive load factors during instruction. Journal of educational psychology, 96(3):558, 2004.
  16. P. K. Komoski. Eric/avcr annual review paper: An imbalance of product quantity and instructional quality: The imperative of empiricism. AV Communication Review, 22(4):357–386, 1974.
  17. Math education with large language models: Peril or promise? Available at SSRN 4641653, 2023.
  18. Improving interpersonal communication by simulating audiences with language models. arXiv preprint arXiv:2311.00687, 2023.
  19. Hypocompass: Large-language-model-based tutor for hypothesis construction in debugging for novices. arXiv preprint arXiv:2310.05292, 2023.
  20. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  21. Gpteach: Interactive ta training with gpt based students. 2023.
  22. Evaluating a simulated student using real students data for training and testing. In International Conference on User Modeling, pages 107–116. Springer, 2007.
  23. Learning by teaching simstudent: Technical accomplishments and an initial use with students. In Intelligent Tutoring Systems: 10th International Conference, ITS 2010, Pittsburgh, PA, USA, June 14-18, 2010, Proceedings, Part I 10, pages 317–326. Springer, 2010.
  24. J. S. Mertz. Using a simulated student for instructional design. International Journal of Artificial Intelligence in Education (IJAIED), 8:116–141, 1997.
  25. Large language models for in-context student modeling: Synthesizing student’s behavior in visual programming from one-shot observation. arXiv preprint arXiv:2310.10690, 2023.
  26. N. Nieveen and E. Folmer. Formative evaluation in educational design research. Design Research, 153(1):152–169, 2013.
  27. The cognitive complexity of learning and doing arithmetic. Journal for Research in Mathematics Education, 23(5):441–467, 1992.
  28. Do language models exhibit the same cognitive biases in problem solving as human learners? arXiv preprint arXiv:2401.18070, 2024.
  29. Variability of worked examples and transfer of geometrical problem-solving skills: A cognitive-load approach. Journal of educational psychology, 86(1):122, 1994.
  30. M. Pankiewicz and R. S. Baker. Large language models (gpt) for automating feedback on programming assignments. arXiv preprint arXiv:2307.00150, 2023.
  31. Z. A. Pardos and S. Bhandari. Learning gain differences between chatgpt and human tutor generated algebra hints. arXiv preprint arXiv:2302.06871, 2023.
  32. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2023.
  33. Social simulacra: Creating populated prototypes for social computing systems. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, pages 1–18, 2022.
  34. Automating human tutor-style programming feedback: Leveraging gpt-4 tutor model for hint generation and gpt-3.5 student model for hint validation. arXiv preprint arXiv:2310.03780, 2023.
  35. Comparing different approaches to generating mathematics explanations using large language models. In International Conference on Artificial Intelligence in Education, pages 290–295. Springer, 2023.
  36. Automatic prompt optimization with" gradient descent" and beam search. arXiv preprint arXiv:2305.03495, 2023.
  37. Physics on autopilot: exploring the use of an ai assistant for independent problem-solving practice. Educational Technology Quarterly, 2024.
  38. Automatic generation of programming exercises and code explanations using large language models. In Proceedings of the 2022 ACM Conference on International Computing Education Research-Volume 1, pages 27–43, 2022.
  39. Ruffle&riley: Towards the automated induction of conversational tutoring systems. arXiv preprint arXiv:2310.01420, 2023.
  40. Rehearsal: Simulating conflict to teach conflict resolution. arXiv preprint arXiv:2309.12309, 2023.
  41. Instructional design. John Wiley & Sons, 2004.
  42. Cognitive architecture and instructional design: 20 years later. Educational psychology review, 31:261–292, 2019.
  43. J. E. Tuovinen and J. Sweller. A comparison of cognitive load associated with discovery learning and worked examples. Journal of educational psychology, 91(2):334, 1999.
  44. Applications of simulated students: An exploration. Journal of artificial intelligence in education, 5:135–135, 1994.
  45. R. E. Wang and D. Demszky. Is chatgpt a good teacher coach? measuring zero-shot performance for scoring and providing actionable insights on classroom instruction. arXiv preprint arXiv:2306.03090, 2023.
  46. Step-by-step remediation of students’ mathematical mistakes. arXiv preprint arXiv:2310.10648, 2023.
  47. Towards human-like educational question generation with large language models. In International conference on artificial intelligence in education, pages 153–166. Springer, 2022.
  48. How do instructional designers evaluate? a qualitative study of evaluation in practice. Educational Technology Research and Development, 59:885–907, 2011.
  49. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
  50. Self-taught optimizer (stop): Recursively self-improving code generation. arXiv preprint arXiv:2310.02304, 2023.
  51. Generating and evaluating tests for k-12 students with language model simulations: A case study on sentence reading efficiency. arXiv preprint arXiv:2310.06837, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Joy He-Yueya (3 papers)
  2. Noah D. Goodman (83 papers)
  3. Emma Brunskill (86 papers)
Citations (3)