Question Difficulty Ranking for Multiple-Choice Reading Comprehension (2404.10704v1)
Abstract: Multiple-choice (MC) tests are an efficient method to assess English learners. It is useful for test creators to rank candidate MC questions by difficulty during exam curation. Typically, the difficulty is determined by having human test takers trial the questions in a pretesting stage. However, this is expensive and not scalable. Therefore, we explore automated approaches to rank MC questions by difficulty. However, there is limited data for explicit training of a system for difficulty scores. Hence, we compare task transfer and zero-shot approaches: task transfer adapts level classification and reading comprehension systems for difficulty ranking while zero-shot prompting of instruction finetuned LLMs contrasts absolute assessment against comparative. It is found that level classification transfers better than reading comprehension. Additionally, zero-shot comparative assessment is more effective at difficulty ranking than the absolute assessment and even the task transfer approaches at question difficulty ranking with a Spearman's correlation of 40.4%. Combining the systems is observed to further boost the correlation.
- J Charles Alderson. 2000. Assessing reading. Cambridge University Press.
- David Andrich. 1988. Rasch models for measurement. 68. Sage.
- A survey on machine reading comprehension systems. Natural Language Engineering, 28(6):683–732.
- Candidate evaluation strategies for improved difficulty prediction of language tests. In Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 1–11.
- How to prepare better multiple-choice test items: Guidelines for university faculty. Ph.D. thesis, Brigham Young University. Department of Instructional Science.
- Cheng-Han Chiang and Hung-yi Lee. 2023. A closer look into automatic evaluation using large language models. arXiv preprint arXiv:2310.05657.
- Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
- Question difficulty prediction for reading problems in standard tests. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Mixtral of experts. arXiv preprint arXiv:2401.04088.
- Terri Barber Kurz. 1999. A review of scoring algorithms for multiple-choice tests.
- Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683.
- A new multi-choice reading comprehension dataset for curriculum learning. In Proceedings of The Eleventh Asian Conference on Machine Learning, volume 101 of Proceedings of Machine Learning Research, pages 742–757, Nagoya, Japan. PMLR.
- Zero-shot nlg evaluation through pairware comparisons with llms. arXiv preprint arXiv:2307.07889.
- Analysis of the cambridge multiple-choice questions reading dataset with a focus on candidate response distribution.
- Textual complexity as a predictor of difficulty of listening items in language proficiency tests. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 3245–3253.
- Jump-starting item parameters for adaptive language tests. In Proceedings of the 2021 conference on empirical methods in natural language processing, pages 883–899.
- Large language models: A survey. arXiv preprint arXiv:2402.06196.
- Edward Moss. 2001. Multiple choice questions: their value as an assessment tool. Current Opinion in Anesthesiology, 14(6):661–666.
- The cambridge multiple-choice questions reading dataset.
- Vatsal Raina and Mark Gales. 2022. Multiple-choice question generation: Towards an automated assessment framework. arXiv preprint arXiv:2209.11830.
- Machine learning–driven language assessment. Transactions of the Association for computational Linguistics, 8:247–263.
- Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944.
- Attention is all you need. Advances in neural information processing systems, 30.
- An information-theoretic approach to analyze nlp classification tasks. arXiv preprint arXiv:2402.00978.
- Predicting the difficulty and response time of multiple choice questions using transfer learning. In Proceedings of the fifteenth workshop on innovative use of NLP for building educational applications, pages 193–197.
- Predicting the difficulty of multiple choice questions in a high-stakes medical exam. In Proceedings of the fourteenth workshop on innovative use of NLP for building educational applications, pages 11–20.
- Vatsal Raina (19 papers)
- Mark Gales (52 papers)