Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TriviaHG: A Dataset for Automatic Hint Generation from Factoid Questions (2403.18426v2)

Published 27 Mar 2024 in cs.CL

Abstract: Nowadays, individuals tend to engage in dialogues with LLMs, seeking answers to their questions. In times when such answers are readily accessible to anyone, the stimulation and preservation of human's cognitive abilities, as well as the assurance of maintaining good reasoning skills by humans becomes crucial. This study addresses such needs by proposing hints (instead of final answers or before giving answers) as a viable solution. We introduce a framework for the automatic hint generation for factoid questions, employing it to construct TriviaHG, a novel large-scale dataset featuring 160,230 hints corresponding to 16,645 questions from the TriviaQA dataset. Additionally, we present an automatic evaluation method that measures the Convergence and Familiarity quality attributes of hints. To evaluate the TriviaHG dataset and the proposed evaluation method, we enlisted 10 individuals to annotate 2,791 hints and tasked 6 humans with answering questions using the provided hints. The effectiveness of hints varied, with success rates of 96%, 78%, and 36% for questions with easy, medium, and hard answers, respectively. Moreover, the proposed automatic evaluation methods showed a robust correlation with annotators' results. Conclusively, the findings highlight three key insights: the facilitative role of hints in resolving unknown questions, the dependence of hint quality on answer difficulty, and the feasibility of employing automatic evaluation methods for hint assessment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Abdelrahman Abdallah and Adam Jatowt. 2023. Generator-retriever-generator: A novel approach to open-domain question answering. arXiv preprint arXiv:2307.11278 (2023).
  2. Deep learning-based question answering: a survey. Knowledge and Information Systems 65, 4 (01 Apr 2023), 1399–1485. https://doi.org/10.1007/s10115-022-01783-5
  3. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  4. Raghunath Arnab. 2017. Chapter 7 - Stratified Sampling. In Survey Sampling Theory and Applications, Raghunath Arnab (Ed.). Academic Press, 213–256. https://doi.org/10.1016/B978-0-12-811848-1.00007-8
  5. Albert Bandura. 2013. The role of self-efficacy in goal-based motivation. New developments in goal setting and task performance (2013), 147–157.
  6. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020).
  7. Semantic Parsing on Freebase from Question-Answer Pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, and Steven Bethard (Eds.). Association for Computational Linguistics, Seattle, Washington, USA, 1533–1544. https://aclanthology.org/D13-1160
  8. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  9. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
  10. Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 291–305. https://doi.org/10.18653/v1/2022.emnlp-main.20
  11. Benchmarking large language models in retrieval-augmented generation. arXiv preprint arXiv:2309.01431 (2023).
  12. H Looren De Jong. 1996. Levels: Reduction and elimination in cognitive neuroscience. Problems of theoretical psychology 6 (1996), 165.
  13. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
  14. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 (2023).
  15. Allen Rovick Gregory Hume, Joel Michael and Martha Evens. 1996. Hinting as a Tactic in One-on-One Tutoring. Journal of the Learning Sciences 5, 1 (1996), 23–47. https://doi.org/10.1207/s15327809jls0501_2
  16. Providing Meaningful Feedback for Autograding of Programming Assignments. In Proceedings of the 49th ACM Technical Symposium on Computer Science Education (Baltimore, Maryland, USA) (SIGCSE ’18). Association for Computing Machinery, New York, NY, USA, 278–283. https://doi.org/10.1145/3159450.3159502
  17. Data mining: concepts and techniques. Morgan kaufmann.
  18. Writing Reusable Code Feedback at Scale with Mixed-Initiative Program Synthesis. In Proceedings of the Fourth (2017) ACM Conference on Learning @ Scale (Cambridge, Massachusetts, USA) (L@S ’17). Association for Computing Machinery, New York, NY, USA, 89–98. https://doi.org/10.1145/3051457.3051467
  19. Gautier Izacard and Edouard Grave. 2021. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty (Eds.). Association for Computational Linguistics, Online, 874–880. https://doi.org/10.18653/v1/2021.eacl-main.74
  20. Automatic Hint Generation. In Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval (Taipei, Taiwan) (ICTIR ’23). Association for Computing Machinery, New York, NY, USA, 117–123. https://doi.org/10.1145/3578337.3605119
  21. Program Representation for Automatic Hint Generation for a Data-Driven Novice Programming Tutor. In Intelligent Tutoring Systems, Stefano A. Cerri, William J. Clancey, Giorgos Papadourakis, and Kitty Panourgia (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 304–309.
  22. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Regina Barzilay and Min-Yen Kan (Eds.). Association for Computational Linguistics, Vancouver, Canada, 1601–1611. https://doi.org/10.18653/v1/P17-1147
  23. Dan Jurafsky. 2000. Speech & language processing. Pearson Education India.
  24. Evaluating Open-Domain Question Answering in the Era of Large Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 5591–5606. https://doi.org/10.18653/v1/2023.acl-long.307
  25. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 6769–6781. https://doi.org/10.18653/v1/2020.emnlp-main.550
  26. Natural Questions: A Benchmark for Question Answering Research. Transactions of the Association for Computational Linguistics 7 (2019), 452–466. https://doi.org/10.1162/tacl_a_00276
  27. Automatic Extraction of AST Patterns for Debugging Student Programs. In Artificial Intelligence in Education, Elisabeth André, Ryan Baker, Xiangen Hu, Ma. Mercedes T. Rodrigo, and Benedict du Boulay (Eds.). Springer International Publishing, Cham, 162–174.
  28. ViDA: A virtual debugging advisor for supporting learning in computer programming courses. Journal of Computer Assisted Learning 34, 3 (2018), 243–258. https://doi.org/10.1111/jcal.12238 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/jcal.12238
  29. Xin Li and Dan Roth. 2002. Learning Question Classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics. https://aclanthology.org/C02-1150
  30. Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (¡conf-loc¿, ¡city¿Virtual Event¡/city¿, ¡country¿Canada¡/country¿, ¡/conf-loc¿) (SIGIR ’21). Association for Computing Machinery, New York, NY, USA, 2356–2362. https://doi.org/10.1145/3404835.3463238
  31. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  32. Zefang Liu. 2023. SecQA: A Concise Question-Answering Dataset for Evaluating Large Language Models in Computer Security. arXiv preprint arXiv:2312.15838 (2023).
  33. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 9802–9822. https://doi.org/10.18653/v1/2023.acl-long.546
  34. Automated Personalized Feedback in Introductory Java Programming MOOCs. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE). 1259–1270. https://doi.org/10.1109/ICDE.2017.169
  35. Supervised and unsupervised neural approaches to text readability. Computational Linguistics 47, 1 (2021), 141–179.
  36. A Survey of Automated Programming Hint Generation: The HINTS Framework. ACM Comput. Surv. 54, 8, Article 172 (oct 2021), 27 pages. https://doi.org/10.1145/3469885
  37. Khalid Nassiri and Moulay Akhloufi. 2023. Transformer models used for text-based question answering systems. Applied Intelligence 53, 9 (01 May 2023), 10602–10635. https://doi.org/10.1007/s10489-022-04052-8
  38. Guiding Next-Step Hint Generation Using Automated Tests. In Proceedings of the 26th ACM Conference on Innovation and Technology in Computer Science Education V. 1 (Virtual Event, Germany) (ITiCSE ’21). Association for Computing Machinery, New York, NY, USA, 220–226. https://doi.org/10.1145/3430665.3456344
  39. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering. In Proceedings of the Conference on Health, Inference, and Learning (Proceedings of Machine Learning Research, Vol. 174), Gerardo Flores, George H Chen, Tom Pollard, Joyce C Ho, and Tristan Naumann (Eds.). PMLR, 248–260. https://proceedings.mlr.press/v174/pal22a.html
  40. Guangyuan Piao. 2021. Scholarly Text Classification with Sentence BERT and Entity Embeddings. In Trends and Applications in Knowledge Discovery and Data Mining, Manish Gupta and Ganesh Ramakrishnan (Eds.). Springer International Publishing, Cham, 79–87.
  41. Autonomously Generating Hints by Inferring Problem Solving Policies. In Proceedings of the Second (2015) ACM Conference on Learning @ Scale (Vancouver, BC, Canada) (L@S ’15). Association for Computing Machinery, New York, NY, USA, 195–204. https://doi.org/10.1145/2724660.2724668
  42. Know What You Don’t Know: Unanswerable Questions for SQuAD. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (2018). https://doi.org/10.18653/v1/p18-2124
  43. MeetingQA: Extractive Question-Answering on Meeting Transcripts. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 15000–15025. https://doi.org/10.18653/v1/2023.acl-long.837
  44. Evaluation of a Data-Driven Feedback Algorithm for Open-Ended Programming. International Educational Data Mining Society (2017).
  45. A Comparison of the Quality of Data-Driven Programming Hint Generation Algorithms. International Journal of Artificial Intelligence in Education 29, 3 (01 Aug 2019), 368–395. https://doi.org/10.1007/s40593-019-00177-z
  46. Towards Interpreting BERT for Reading Comprehension Based QA. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 3236–3242. https://doi.org/10.18653/v1/2020.emnlp-main.261
  47. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/1908.10084
  48. Kelly Rivers. 2017. Automated Data-Driven Hint Generation for Learning Programming. (7 2017). https://doi.org/10.1184/R1/6714911.v1
  49. Learning Curve Analysis for Programming: Which Concepts do Students Struggle With?. In Proceedings of the 2016 ACM Conference on International Computing Education Research (Melbourne, VIC, Australia) (ICER ’16). Association for Computing Machinery, New York, NY, USA, 143–151. https://doi.org/10.1145/2960310.2960333
  50. QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension. ACM Comput. Surv. 55, 10, Article 197 (feb 2023), 45 pages. https://doi.org/10.1145/3560260
  51. RoFormer: Enhanced transformer with Rotary Position Embedding. Neurocomputing 568 (2024), 127063. https://doi.org/10.1016/j.neucom.2023.127063
  52. Autohint: Automatic prompt optimization with hint generation. arXiv preprint arXiv:2307.07415 (2023).
  53. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023).
  54. Ryan Tolboom. 2023. Computer Systems Security.
  55. Eliminating Reasoning via Inferring with Planning: A New Framework to Guide LLMs’ Non-linear Thinking. arXiv preprint arXiv:2310.12342 (2023).
  56. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  57. Ellen L. Usher and Frank Pajares. 2006. Sources of academic and self-regulatory efficacy beliefs of entering middle school students. Contemporary Educational Psychology 31, 2 (2006), 125–141. https://doi.org/10.1016/j.cedpsych.2005.03.002
  58. ArchivalQA: A Large-scale Benchmark Dataset for Open-Domain Question Answering over Historical News Collections. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (¡conf-loc¿, ¡city¿Madrid¡/city¿, ¡country¿Spain¡/country¿, ¡/conf-loc¿) (SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 3025–3035. https://doi.org/10.1145/3477495.3531734
  59. Query2doc: Query Expansion with Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 9414–9423. https://doi.org/10.18653/v1/2023.emnlp-main.585
  60. A Survey of Extractive Question Answering. In 2022 International Conference on High Performance Big Data and Intelligent Systems (HDIS). 147–153. https://doi.org/10.1109/HDIS56859.2022.9991478
  61. Dewey Lonzo Whaley III. 2005. The interquartile range: Theory and estimation. Ph. D. Dissertation. East Tennessee State University.
  62. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022).
  63. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023).
  64. Puning Yu and Yunyi Liu. 2021. Roberta-based Encoder-decoder Model for Question Answering System. In 2021 International Conference on Intelligent Computing, Automation and Applications (ICAA). 344–349. https://doi.org/10.1109/ICAA53760.2021.00070
  65. Generate rather than Retrieve: Large Language Models are Strong Context Generators. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=fB0hRu9GZUS
  66. Big Bird: Transformers for Longer Sequences. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 17283–17297. https://proceedings.neurips.cc/paper_files/paper/2020/file/c8512d142a2d849725f31a9a7a361ab9-Paper.pdf
  67. BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations. https://openreview.net/forum?id=SkeHuCVFDr
  68. JEC-QA: A Legal-Domain Question Answering Dataset. Proceedings of the AAAI Conference on Artificial Intelligence 34, 05 (Apr. 2020), 9701–9708. https://doi.org/10.1609/aaai.v34i05.6519
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets