Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating LLMs on Document-Based QA: Exact Answer Selection and Numerical Extraction using Cogtale dataset (2311.07878v4)

Published 14 Nov 2023 in cs.IR

Abstract: Document-based Question-Answering (QA) tasks are crucial for precise information retrieval. While some existing work focus on evaluating LLMs performance on retrieving and answering questions from documents, assessing the LLMs performance on QA types that require exact answer selection from predefined options and numerical extraction is yet to be fully assessed. In this paper, we specifically focus on this underexplored context and conduct empirical analysis of LLMs (GPT-4 and GPT-3.5) on question types, including single-choice, yes-no, multiple-choice, and number extraction questions from documents in zero-shot setting. We use the CogTale dataset for evaluation, which provide human expert-tagged responses, offering a robust benchmark for precision and factual grounding. We found that LLMs, particularly GPT-4, can precisely answer many single-choice and yes-no questions given relevant context, demonstrating their efficacy in information retrieval tasks. However, their performance diminishes when confronted with multiple-choice and number extraction formats, lowering the overall performance of the model on this task, indicating that these models may not yet be sufficiently reliable for the task. This limits the applications of LLMs on applications demanding precise information extraction from documents, such as meta-analysis tasks. These findings hinge on the assumption that the retrievers furnish pertinent context necessary for accurate responses, emphasizing the need for further research. Our work offers a framework for ongoing dataset evaluation, ensuring that LLM applications for information retrieval and document analysis continue to meet evolving standards.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Llm based generation of item-description for recommendation system, in: Proceedings of the 17th ACM Conference on Recommender Systems, pp. 1204–1207.
  2. Using large language models to simulate multiple humans and replicate human subject studies, in: International Conference on Machine Learning, PMLR. pp. 337–371.
  3. Benchmarking foundation models with language-model-as-an-examiner. arXiv:2306.04181.
  4. Effects of reality orientation therapy on elderly patients in the community. Archives of gerontology and geriatrics 17, 211–218.
  5. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv:2302.04023.
  6. Chatgpt is a knowledgeable but inexperienced solver: An investigation of commonsense problem in large language models. arXiv:2303.16421.
  7. Cognitive rehabilitation combined with drug treatment in alzheimer’s disease patients: a pilot study. Clinical Rehabilitation 19, 861–869.
  8. Cognitive training in older adults with mild cognitive impairment: Impact on cognitive and functional performance. Dementia & Neuropsychologia 3, 124–131.
  9. Benefits of training working memory in amnestic mild cognitive impairment: specific and transfer effects. International Psychogeriatrics 25, 617–626.
  10. Impact of metacognition and motivation on the efficacy of strategic memory training in older adults: Analysis of specific, transfer and maintenance effects. Archives of gerontology and geriatrics 52, e192–e197.
  11. Computerized structured cognitive training in patients affected by early-stage alzheimer’s disease is feasible and effective: a randomized controlled study. Archives of Clinical Neuropsychology 31, 868–876.
  12. A survey on evaluation of large language models. arXiv:2307.03109.
  13. A dataset of information-seeking questions and answers anchored in research papers. arXiv preprint arXiv:2105.03011 .
  14. Gpt-3.5, gpt-4, or bard? evaluating llms reasoning ability in zero-shot setting and performance boosting through prompts. Natural Language Processing Journal 5, 100032.
  15. Iirc: A dataset of incomplete information reading comprehension questions. arXiv preprint arXiv:2011.07127 .
  16. Repetition-lag training to improve recollection memory in older people with amnestic mild cognitive impairment. a randomized controlled trial. Aging, Neuropsychology, and Cognition 22, 244–258.
  17. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics 9, 346–361.
  18. Toward rational use of cognitive training in those with mild cognitive impairment. Alzheimer’s & Dementia 19, 933–945.
  19. Efficacy of the ubiquitous spaced retrieval-based memory advancement and rehabilitation training (usmart) program among patients with mild cognitive impairment: a randomized controlled crossover trial. Alzheimer’s research & therapy 9, 1–8.
  20. PubMedQA: A dataset for biomedical research question answering, in: Inui, K., Jiang, J., Ng, V., Wan, X. (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China. pp. 2567–2577. URL: https://aclanthology.org/D19-1259, doi:10.18653/v1/D19-1259.
  21. A survey of gpt-3 family large language models including chatgpt and gpt-4. Natural Language Processing Journal , 100048.
  22. Evaluating open-domain question answering in the era of large language models. arXiv preprint arXiv:2305.06984 .
  23. Bioasq-qa: A manually curated corpus for biomedical question answering. Scientific Data 10, 170.
  24. Cognitive rehabilitation in patients with mild cognitive impairment. International Journal of Geriatric Psychiatry: A journal of the psychiatry of late life and allied sciences 24, 163–168.
  25. Effectiveness of a visual imagery training program to improve prospective memory in older adults with and without mild cognitive impairment: A randomized controlled study. Neuropsychological Rehabilitation 32, 1576–1604.
  26. Huge frozen language models as readers for open-domain question answering, in: ICML 2022 Workshop on Knowledge Retrieval and Language Models. URL: https://openreview.net/forum?id=z3Bxu8xNJaF.
  27. Gradually excavating external knowledge for implicit complex question answering, in: Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 14405–14417.
  28. OpenAI, 2023. Gpt-4 technical report. arXiv:2303.08774.
  29. A challenge on large-scale biomedical semantic indexing and question answering .
  30. Visconde: Multi-document qa with gpt-3 and neural reranking, in: European Conference on Information Retrieval, Springer. pp. 534–543.
  31. Is chatgpt a general-purpose natural language processing task solver? arXiv:2302.06476.
  32. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083 .
  33. Efficacy of a cognitive intervention program in patients with mild cognitive impairment. International Psychogeriatrics 25, 825–831.
  34. Cogtale: an online platform for the evaluation, synthesis, and dissemination of evidence from cognitive interventions studies. Systematic Reviews 10, 1–11.
  35. Replug: Retrieval-augmented black-box language models. arXiv:2301.12652.
  36. Tree prompting: Efficient task adaptation without fine-tuning, in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 6253–6267.
  37. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 .
  38. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 .
  39. The pace study: a randomized clinical trial of cognitive activity strategy training for older people with mild cognitive impairment. The American Journal of Geriatric Psychiatry 23, 360–372.
  40. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35, 24824–24837.
  41. A survey of large language models. arXiv:2303.18223.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Zafaryab Rasool (6 papers)
  2. Stefanus Kurniawan (5 papers)
  3. Sherwin Balugo (2 papers)
  4. Scott Barnett (20 papers)
  5. Rajesh Vasa (27 papers)
  6. Courtney Chesser (1 paper)
  7. Benjamin M. Hampstead (1 paper)
  8. Sylvie Belleville (1 paper)
  9. Kon Mouzakis (9 papers)
  10. Alex Bahar-Fuchs (1 paper)
Citations (13)