Papers
Topics
Authors
Recent
2000 character limit reached

Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models (2404.07720v2)

Published 11 Apr 2024 in cs.CL

Abstract: Reading comprehension tests are used in a variety of applications, reaching from education to assessing the comprehensibility of simplified texts. However, creating such tests manually and ensuring their quality is difficult and time-consuming. In this paper, we explore how LLMs can be used to generate and evaluate multiple-choice reading comprehension items. To this end, we compiled a dataset of German reading comprehension items and developed a new protocol for human and automatic evaluation, including a metric we call text informativity, which is based on guessability and answerability. We then used this protocol and the dataset to evaluate the quality of items generated by Llama 2 and GPT-4. Our results suggest that both models are capable of generating items of acceptable quality in a zero-shot setting, but GPT-4 clearly outperforms Llama 2. We also show that LLMs can be used for automatic evaluation by eliciting item reponses from them. In this scenario, evaluation results with GPT-4 were the most similar to human annotators. Overall, zero-shot generation with LLMs is a promising approach for generating and evaluating reading comprehension test items, in particular for languages without large amounts of available data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Comparison of methods for evaluating complexity of simplified texts among deaf and hard-of-hearing adults at different literacy levels. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21. ACM.
  2. Evaluation methodologies in automatic question generation 2013-2018. In Proceedings of the 11th International Conference on Natural Language Generation. Association for Computational Linguistics.
  3. The interactive reading task: Transformer-based automatic item generation. Frontiers in Artificial Intelligence, 5.
  4. Constitutional ai: Harmlessness from ai feedback.
  5. The Belebele benchmark: a parallel reading comprehension dataset in 122 language variants.
  6. Generation of English question answer exercises from texts using transformers based models. 2022 IEEE Latin American Conference on Computational Intelligence (LA-CCI), pages 1–5.
  7. STARC: Structured annotations for reading comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  8. Matthew Byrd and Shashank Srivastava. 2022. Predicting difficulty and discrimination of natural language questions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics.
  9. Automatic item generation: foundations and machine learning-based approaches for assessments. Frontiers in Education, 8.
  10. Reading comprehension quiz generation using generative pre-trained transformers. In iTextbooks@AIED.
  11. Learning to ask: Neural question generation for reading comprehension. In Annual Meeting of the Association for Computational Linguistics.
  12. An automatic question generator for Chinese comprehension. Inventions.
  13. Difficulty controllable generation of reading comprehension questions. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-2019. International Joint Conferences on Artificial Intelligence Organization.
  14. Question generation for reading comprehension assessment by modeling how and what to ask.
  15. Advanced Methods in Automatic Item Generation. Routledge.
  16. Rita Green. 2020. Pilot testing: Why and how we trial. In The Routledge Handbook of Second Language Acquisition and Language Testing, chapter 11, pages 115–124. Routledge.
  17. Thomas M. Haladyna. 2013. Automatic item generation: A historical perspective. In Mark J. Gierl and Thomas M. Haladyna, editors, Automatic Item Generation: Theory and Practice, chapter 2, pages 13–25. Routledge, New York.
  18. Eun Hee Jeon and Junko Yamashita. 2020. Measuring L2 reading. In The Routledge Handbook of Second Language Acquisition and Language Testing, chapter 25, pages 265–274. Routledge.
  19. EQG-RACE: Examination-type question generation.
  20. Glyn Jones. 2020. Designing multiple-choice test items. In The Routledge Handbook of Second Language Acquisition and Language Testing, chapter 9, pages 90–101. Routledge.
  21. Dmytro Kalpakchi and Johan Boye. 2023. Quasi: a synthetic question-answering dataset in Swedish using GPT-3 and zero-shot learning. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 477–491, Tórshavn, Faroe Islands. University of Tartu Library.
  22. Tassilo Klein and Moin Nabi. 2019. Learning to answer by learning to ask: Getting the best of GPT-2 and BERT worlds.
  23. RACE: Large-scale reading comprehension dataset from examinations.
  24. Hollis Lai and Mark J. Gierl. 2013. Generating items under the assessment engineering framework. In Mark J. Gierl and Thomas M. Haladyna, editors, Automatic Item Generation: Theory and Practice, chapter 6, pages 77–101. Routledge, New York.
  25. Learning latent parameters without human response patterns: Item response theory with artificial crowds. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics.
  26. Evaluation of an online text simplification editor using manual and automated metrics for perceived and actual text difficulty. JAMIA Open, 5(2).
  27. “World knowledge” in multiple choice reading comprehension. In Proceedings of the Sixth Fact Extraction and VERification Workshop (FEVER). Association for Computational Linguistics.
  28. Simplifying paragraph-level question generation via transformer language models.
  29. MQAG: Multiple-choice question answering and generation for assessing information consistency in summarization. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 39–53, Nusa Dua, Bali. Association for Computational Linguistics.
  30. Kaushal Kumar Maurya and Maunendra Sankar Desarkar. 2020. Learning to distract: A hierarchical multi-decoder network for automated generation of long distractors for multiple-choice questions for reading comprehension. Proceedings of the 29th ACM International Conference on Information & Knowledge Management.
  31. Nikahat Mulla and Prachi Gharpure. 2023. Automatic question generation: A review of methodologies, datasets, evaluation metrics, and applications. Progress in Artificial Intelligence, pages 1–32.
  32. OpenAI. 2023. GPT-4 technical report.
  33. Training language models to follow instructions with human feedback.
  34. Barbara Plank. 2022. The “problem” of human label variation: On ground truth in data, modeling and evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  35. Strength in numbers: Estimating confidence of large language models by prompt agreement. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 326–362, Toronto, Canada. Association for Computational Linguistics.
  36. Vatsal Raina and Mark Gales. 2022. Multiple-choice question generation: Towards an automated assessment framework.
  37. Analyzing multiple-choice reading and listening comprehension tests.
  38. Educational multi-question generation for reading comprehension. Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022).
  39. Digital comprehensibility assessment of simplified texts among persons with intellectual disabilities.
  40. Topic enhanced multi-head co-attention: Generating distractors for reading comprehension. 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8.
  41. Llama 2: Open foundation and fine-tuned chat models.
  42. Difficulty-controllable neural question generation for reading comprehension using item response theory. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023). Association for Computational Linguistics.
  43. Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  44. Finetuned language models are zero-shot learners.
  45. Diverse distractor generation for constructing high-quality multiple choice questions. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:280–291.
  46. Machine comprehension by text-to-text neural question generation. In Proceedings of the 2nd Workshop on Representation Learning for NLP. Association for Computational Linguistics.
  47. Neural question generation from text: A preliminary study.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.