Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
GPT-4o
Gemini 2.5 Pro Pro
o3 Pro
GPT-4.1 Pro
DeepSeek R1 via Azure Pro
2000 character limit reached

Piecing Together Clues: A Benchmark for Evaluating the Detective Skills of Large Language Models (2307.05113v3)

Published 11 Jul 2023 in cs.CL

Abstract: Detectives frequently engage in information detection and reasoning simultaneously when making decisions across various cases, especially when confronted with a vast amount of information. With the rapid development of LLMs~(LLMs), evaluating how these models identify key information and reason to solve questions becomes increasingly relevant. We introduces the DetectBench, a reading comprehension dataset designed to assess a model's ability to jointly ability in key information detection and multi-hop reasoning when facing complex and implicit information. The DetectBench comprises 3,928 questions, each paired with a paragraph averaging 190 tokens in length. To enhance model's detective skills, we propose the Detective Thinking Framework. These methods encourage models to identify all possible clues within the context before reasoning. Our experiments reveal that existing models perform poorly in both information detection and multi-hop reasoning. However, the Detective Thinking Framework approach alleviates this issue.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Topiocqa: Open-domain conversational question answering with topic switching. Transactions of the Association for Computational Linguistics, 10:468–483.
  2. FEVEROUS: Fact extraction and VERification over unstructured and structured information.
  3. Open-domain question answering goes conversational via question rewriting. arXiv preprint arXiv:2010.04898.
  4. Uncommonsense: Informative negative knowledge about everyday concepts. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 37–46.
  5. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687.
  6. Abductive commonsense reasoning. arXiv preprint arXiv:1908.05739.
  7. Benchmarking large language models in retrieval-augmented generation. arXiv preprint arXiv:2309.01431.
  8. Hacred: A large-scale relation extraction dataset toward hard cases in practical applications. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2819–2831.
  9. Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720.
  10. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361.
  11. Xiezhi: An ever-updating benchmark for holistic domain knowledge evaluation. arXiv preprint arXiv:2306.05783.
  12. Samreen Kazi and Shakeel Khoja. 2021. Uquad1. 0: Development of an urdu question answering training data for machine reading comprehension. arXiv preprint arXiv:2111.01543.
  13. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  14. Unified structure generation for universal information extraction. arXiv preprint arXiv:2203.12277.
  15. Contextual embedding and model weighting by fusing domain knowledge on biomedical question answering. In Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pages 1–4.
  16. Information extraction meets the semantic web: a survey. Semantic Web, 11(2):255–335.
  17. OpenAI. 2023a. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt.
  18. OpenAI. 2023b. Gpt-4 technical report.
  19. Thinking like a skeptic: Defeasible inference in natural language. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4661–4675.
  20. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
  21. Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems.
  22. Fever: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355.
  23. The FEVER2.0 shared task. In Proceedings of the Second Workshop on Fact Extraction and VERification (FEVER).
  24. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  25. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091.
  26. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  27. Inscit: Information-seeking conversations with mixed-initiative interactions.
  28. Wizardlm: Empowering large language models to follow complex instructions.
  29. A survey of information extraction based on deep learning. Applied Sciences, 12(19):9691.
  30. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600.
  31. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601.
  32. Reclor: A reading comprehension dataset requiring logical reasoning. arXiv preprint arXiv:2002.04326.
  33. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.
  34. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493.
  35. Uncommonsense reasoning: Abductive reasoning about uncommon situations. arXiv preprint arXiv:2311.08469.
  36. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  37. Large language models for information retrieval: A survey. arXiv preprint arXiv:2308.07107.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.