Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
104 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
40 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Chain of Evidences and Evidence to Generate: Prompting for Context Grounded and Retrieval Augmented Reasoning (2401.05787v2)

Published 11 Jan 2024 in cs.CL

Abstract: While chain-of-thoughts (CoT) prompting has revolutionized how LLMs perform reasoning tasks, its current methods and variations (e.g, Self-consistency, ReACT, Reflexion, Tree-of-Thoughts (ToT), Cumulative Reasoning (CR) etc.,) suffer from limitations like limited context grounding, hallucination/inconsistent output generation, and iterative sluggishness. To overcome these challenges, we introduce a novel mono/dual-step zero-shot prompting framework built upon two unique strategies Chain of Evidences (CoE)} and Evidence to Generate (E2G). Instead of unverified reasoning claims, our innovative approaches leverage the power of "evidence for decision making" by first focusing exclusively on the thought sequences explicitly mentioned in the context which then serve as extracted evidence, guiding the LLM's output generation process with greater precision and efficiency. This simple yet potent approach unlocks the full potential of chain-of-thoughts prompting, facilitating faster, more reliable, and contextually aware reasoning in LLMs. Our framework consistently achieves remarkable results across various knowledge-intensive reasoning and generation tasks, surpassing baseline approaches with state-of-the-art LLMs. For instance, (i) on the LogiQA benchmark using GPT-4, CoE achieves a new state-of-the-art accuracy of 53.8%, surpassing CoT by 18%, ToT by 11%, and CR by 9%; (ii) CoE with PaLM-2 outperforms the variable-shot performance of Gemini Ultra by 0.9 F1 points, achieving an F1 score of 83.3 on DROP. We release our prompts and outputs on these benchmarks as a new instruction tuning dataset for future research at https://huggingface.co/datasets/kagnlp/Chain-of-Evidences/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Evidentiality-guided generation for knowledge-intensive NLP tasks. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2226–2243, Seattle, United States. Association for Computational Linguistics.
  2. Selection-inference: Exploiting large language models for interpretable logical reasoning. arXiv preprint arXiv:2205.09712.
  3. Wizard of Wikipedia: Knowledge-powered conversational agents. In International Conference on Learning Representations.
  4. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368–2378, Minneapolis, Minnesota. Association for Computational Linguistics.
  5. ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567, Florence, Italy. Association for Computational Linguistics.
  6. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR.
  7. REALM: Retrieval-augmented language model pre-training. In International Conference on Machine Learning. JMLR.org.
  8. Exploring human-like translation strategy with large language models. arXiv preprint arXiv:2305.04118.
  9. Large language models can self-improve. arXiv preprint arXiv:2210.11610.
  10. Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, Online. Association for Computational Linguistics.
  11. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics.
  12. Maieutic prompting: Logically consistent reasoning with recursive explanations. arXiv preprint arXiv:2205.11822.
  13. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
  14. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024.
  15. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491.
  16. Self-generated in-context learning: Leveraging auto-regressive language models as a demonstration generator. arXiv preprint arXiv:2206.08082.
  17. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  18. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
  19. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  20. Retrieval-Augmented Generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–9474.
  21. Self-prompting large language models for open-domain qa. arXiv preprint arXiv:2212.08635.
  22. Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477.
  23. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning. arXiv preprint arXiv:2007.08124.
  24. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172.
  25. Neurologic a* esque decoding: Constrained text generation with lookahead heuristics. arXiv preprint arXiv:2112.08726.
  26. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
  27. Bioasq at clef2023: The eleventh edition of the large-scale biomedical semantic indexing and question answering challenge. In Advances in Information Retrieval.
  28. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114.
  29. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  30. Fact-driven logical reasoning. CoRR, abs/2105.10334.
  31. Md Rizwan Parvez and Kai-Wei Chang. 2021. Evaluating the values of sources in transfer learning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5084–5116.
  32. Retrieval enhanced data augmentation for question answering on privacy policies. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 201–210, Dubrovnik, Croatia. Association for Computational Linguistics.
  33. Refiner: Reasoning feedback on intermediate representations. arXiv preprint arXiv:2304.01904.
  34. KILT: a benchmark for knowledge intensive language tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2523–2544, Online. Association for Computational Linguistics.
  35. Automatic prompt optimization with" gradient descent" and beam search. arXiv preprint arXiv:2305.03495.
  36. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
  37. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366.
  38. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  39. Recitation-augmented language models. arXiv preprint arXiv:2210.01296.
  40. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics.
  41. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  42. Learning to filter context for retrieval-augmented generation. arXiv preprint arXiv:2311.08377.
  43. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  44. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  45. Decomposition enhances reasoning via self-evaluation guided decoding. arXiv preprint arXiv:2305.00633.
  46. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.
  47. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601.
  48. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
  49. Large language models as analogical reasoners. arXiv preprint arXiv:2310.01714.
  50. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488.
  51. Beam retrieval: General end-to-end retrieval for multi-hop question answering. arXiv preprint arXiv:2308.08973.
  52. Cumulative reasoning with large language models. arXiv preprint arXiv:2308.04371.
  53. Solving math word problem via cooperative reasoning induced language models. arXiv preprint arXiv:2210.16257.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com