Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Researchy Questions: A Dataset of Multi-Perspective, Decompositional Questions for LLM Web Agents (2402.17896v1)

Published 27 Feb 2024 in cs.CL and cs.AI

Abstract: Existing question answering (QA) datasets are no longer challenging to most powerful LLMs. Traditional QA benchmarks like TriviaQA, NaturalQuestions, ELI5 and HotpotQA mainly study known unknowns'' with clear indications of both what information is missing, and how to find it to answer the question. Hence, good performance on these benchmarks provides a false sense of security. A yet unmet need of the NLP community is a bank of non-factoid, multi-perspective questions involving a great deal of unclear information needs, i.e.unknown uknowns''. We claim we can find such questions in search engine logs, which is surprising because most question-intent queries are indeed factoid. We present Researchy Questions, a dataset of search engine queries tediously filtered to be non-factoid, decompositional'' and multi-perspective. We show that users spend a lot ofeffort'' on these questions in terms of signals like clicks and session length, and that they are also challenging for GPT-4. We also show that ``slow thinking'' answering techniques, like decomposition into sub-questions shows benefit over answering directly. We release $\sim$ 100k Researchy Questions, along with the Clueweb22 URLs that were clicked.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Query refinement prompts for closed-book long-form question answering.
  2. Anne Aula and Daniel Russell. 2008. Complex and exploratory web search.
  3. Ms marco: A human generated machine reading comprehension dataset.
  4. Semantic parsing on Freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533–1544, Seattle, Washington, USA. Association for Computational Linguistics.
  5. A non-factoid question-answering taxonomy. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, page 1196–1207, New York, NY, USA. Association for Computing Machinery.
  6. WikiHowQA: A comprehensive benchmark for multi-document non-factoid question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5291–5314, Toronto, Canada. Association for Computational Linguistics.
  7. Improving language models by retrieving from trillions of tokens.
  8. Function-based question classification for general QA. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1119–1128, Cambridge, MA. Association for Computational Linguistics.
  9. Ms marco: A human generated machine reading comprehension dataset. ArXiv, abs/1611.09268.
  10. HybridQA: A dataset of multi-hop question answering over tabular and textual data. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1026–1036, Online. Association for Computational Linguistics.
  11. A tale of tails: Model collapse as a change of scaling laws.
  12. Searchqa: A new q&a dataset augmented with context from a search engine.
  13. Brian Everitt. 1974. Cluster analysis. Heinemann Educational [for] the Social Science Research Council.
  14. ELI5: long form question answering. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 3558–3567. Association for Computational Linguistics.
  15. The pile: An 800gb dataset of diverse text for language modeling.
  16. A real-world webagent with planning, long context understanding, and program synthesis.
  17. Realm: Retrieval-augmented language model pre-training.
  18. Struggling or exploring? disambiguating long search sessions. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining, WSDM ’14, page 53–62, New York, NY, USA. Association for Computing Machinery.
  19. Supporting complex search tasks. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM ’14, page 829–838, New York, NY, USA. Association for Computing Machinery.
  20. Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, Online. Association for Computational Linguistics.
  21. Mixtral of experts.
  22. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547.
  23. triviaqa: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. arXiv e-prints, page arXiv:1705.03551.
  24. Daniel Kahneman. 2011. Thinking, Fast and Slow. Farrar, Straus and Giroux, New York.
  25. Dense passage retrieval for open-domain question answering.
  26. Development and evaluation of search tasks for iir experiments using a cognitive complexity framework. In Proceedings of the 2015 International Conference on The Theory of Information Retrieval, ICTIR ’15, page 101–110, New York, NY, USA. Association for Computing Machinery.
  27. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp.
  28. Decomposed prompting: A modular approach for solving complex tasks.
  29. Hurdles to progress in long-form question answering.
  30. Aquamuse: Automatically generating datasets for query-based multi-document summarization.
  31. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics.
  32. Retrieval-augmented generation for knowledge-intensive nlp tasks.
  33. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  34. Lost in the middle: How language models use long contexts.
  35. Evaluating verifiability in generative search engines.
  36. Agentbench: Evaluating llms as agents.
  37. Gaia: a benchmark for general ai assistants.
  38. Webgpt: Browser-assisted question-answering with human feedback.
  39. Gpt-4 technical report.
  40. Clueweb22: 10 billion web documents with visual and semantic information.
  41. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
  42. Measuring and narrowing the compositionality gap in language models.
  43. Question decomposition improves the faithfulness of model-generated reasoning.
  44. Exploring the limits of transfer learning with a unified text-to-text transformer.
  45. Rocketqav2: A joint training method for dense passage retrieval and passage re-ranking.
  46. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy.
  47. C. T. Stayton. 2015. What does convergent evolution mean? the interpretation of convergence and its implications in the search for limits to evolution. Interface Focus, 5(6):20150039.
  48. Nassim Nicholas Taleb. 2008. The Black Swan. Penguin Books, Harlow, England.
  49. MuSiQue: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics.
  50. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions.
  51. NASA Program Management and Procurement Procedures and Practices: Hearings Before the Subcommittee on Space Science and Applications of the Committee on Science and Technology, U.S. House of Representatives, Ninety-seventh Congress, First Session. U.S. Government Printing Office, Washington, D.C.
  52. Chain-of-thought prompting elicits reasoning in large language models.
  53. Ai-generated content (aigc): A survey.
  54. Autogen: Enabling next-gen llm applications via multi-agent conversation.
  55. Approximate nearest neighbor negative contrastive learning for dense text retrieval.
  56. Hotpotqa: A dataset for diverse, explainable multi-hop question answering.
  57. Tree of thoughts: Deliberate problem solving with large language models.
  58. React: Synergizing reasoning and acting in language models.
  59. Self-rewarding language models.
  60. Beam retrieval: General end-to-end retrieval for multi-hop question answering.
  61. Character-level convolutional networks for text classification.
  62. Learning to decompose and organize complex tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2726–2735, Online. Association for Computational Linguistics.
  63. Judging llm-as-a-judge with mt-bench and chatbot arena.
  64. Why does chatgpt fall short in providing truthful answers?
  65. Don’t make your llm an evaluation benchmark cheater.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Corby Rosset (21 papers)
  2. Ho-Lam Chung (13 papers)
  3. Guanghui Qin (16 papers)
  4. Ethan C. Chau (5 papers)
  5. Zhuo Feng (24 papers)
  6. Ahmed Awadallah (27 papers)
  7. Jennifer Neville (57 papers)
  8. Nikhil Rao (34 papers)
Citations (4)