Beyond Flesch-Kincaid: Prompt-based Metrics Improve Difficulty Classification of Educational Texts (2405.09482v2)
Abstract: Using LLMs for educational applications like dialogue-based teaching is a hot topic. Effective teaching, however, requires teachers to adapt the difficulty of content and explanations to the education level of their students. Even the best LLMs today struggle to do this well. If we want to improve LLMs on this adaptation task, we need to be able to measure adaptation success reliably. However, current Static metrics for text difficulty, like the Flesch-Kincaid Reading Ease score, are known to be crude and brittle. We, therefore, introduce and evaluate a new set of Prompt-based metrics for text difficulty. Based on a user study, we create Prompt-based metrics as inputs for LLMs. They leverage LLM's general language understanding capabilities to capture more abstract and complex features than Static metrics. Regression experiments show that adding our Prompt-based metrics significantly improves text difficulty classification over Static metrics alone. Our results demonstrate the promise of using LLMs to evaluate text adaptation to different education levels.
- Text-based question difficulty prediction: A systematic review of automatic approaches. International Journal of Artificial Intelligence in Education, pages 1–53.
- Using natural language processing to predict item response times and improve test construction. Journal of Educational Measurement, 58(1):4–30.
- Simple or complex? complexity-controllable question generation with soft templates and deep mixture of experts model. arXiv preprint arXiv:2110.06560.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Using large language models to develop readability formulas for educational settings. In International Conference on Artificial Intelligence in Education, pages 422–427. Springer.
- Moving beyond classic readability formulas: New methods and new models. Journal of Research in Reading, 42(3-4):541–561.
- Exploring stylistic variation with age and income on twitter. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 313–319.
- Rudolph Flesch. 1948. A new readability yardstick. Journal of applied psychology, 32(3):221.
- Do llms implicitly determine the suitable text difficulty for users? arXiv preprint arXiv:2402.14453.
- Google. 2024. Responsible Generative AI Toolk, GemmaTechnical Report. https://ai.google.dev/gemma/docs. Accessed: March 6, 2024.
- An exploratory survey about using chatgpt in education, healthcare, and research. medRxiv, pages 2023–03.
- Automated estimation of item difficulty for multiple-choice tests: An application of word embedding techniques. Information Processing & Management, 54(6):969–984.
- Joseph Marvin Imperial and Harish Tayyar Madabushi. 2023. Flesch or fumble? evaluating readability standard alignment of instruction-tuned language models. arXiv preprint arXiv:2309.05454.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Ontology-based generation of medical, multi-term mcqs. International Journal of Artificial Intelligence in Education, 29:145–188.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521.
- OpenAI. 2023. GPT-4 Technical Report.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- Automatic classification of question difficulty level: Teachers’ estimation vs. students’ perception. In 2012 Frontiers in Education Conference Proceedings, pages 1–5. IEEE.
- Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
- Marta Recasens and Eduard Hovy. 2011. Blanc: Implementing the rand index for coreference evaluation. Natural language engineering, 17(4):485–510.
- Know your audience: Do llms adapt to different age and education levels? arXiv preprint arXiv:2312.02065.
- Malik Sallam. 2023. Chatgpt utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare, 11(6).
- Lisa S Stamps. 2004. The effectiveness of curriculum compacting in first grade classrooms. Roeper Review, 27(1):31–41.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Improving mathematics tutoring with a code scratchpad. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pages 20–28.
- Predicting the difficulty and response time of multiple choice questions using transfer learning. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 193–197.
- Practical and ethical challenges of large language models in education: A systematic literature review. arXiv preprint arXiv:2303.13379.
- Predicting the difficulty of multiple choice questions in a high-stakes medical exam. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 11–20.