Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
52 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
15 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation (2403.05313v1)

Published 8 Mar 2024 in cs.CL and cs.AI

Abstract: We explore how iterative revising a chain of thoughts with the help of information retrieval significantly improves LLMs' reasoning and generation ability in long-horizon generation tasks, while hugely mitigating hallucination. In particular, the proposed method -- retrieval-augmented thoughts (RAT) -- revises each thought step one by one with retrieved information relevant to the task query, the current and the past thought steps, after the initial zero-shot CoT is generated. Applying RAT to GPT-3.5, GPT-4, and CodeLLaMA-7b substantially improves their performances on various long-horizon generation tasks; on average of relatively increasing rating scores by 13.63% on code generation, 16.96% on mathematical reasoning, 19.2% on creative writing, and 42.78% on embodied task planning. The demo page can be found at https://craftjarvis.github.io/RAT

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511, 2023.
  2. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  3. Knowledge-augmented language model prompting for zero-shot knowledge graph question answering. arXiv preprint arXiv:2306.04136, 2023.
  4. Video pretraining (vpt): Learning to act by watching unlabeled online videos. arXiv preprint arXiv:2206.11795, 2022.
  5. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  7. Open-world multi-task control through goal-aware representation learning and adaptive horizon prediction. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13734–13744, 2023a.
  8. Groot: Learning to follow instructions by watching gameplay videos. arXiv preprint arXiv:2310.08235, 2023b.
  9. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  10. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  11. A. Creswell and M. Shanahan. Faithful reasoning using large language models. arXiv preprint arXiv:2208.14271, 2022.
  12. Selection-inference: Exploiting large language models for interpretable logical reasoning. arXiv preprint arXiv:2205.09712, 2022.
  13. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv: 2309.11495, 2023.
  14. Retrieval-generation synergy augmented large language models. ArXiv, abs/2310.05149, 2023a.
  15. Retrieval-generation synergy augmented large language models. arXiv preprint arXiv:2310.05149, 2023b.
  16. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022.
  17. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
  18. Search engine guided neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
  19. Deepseek-coder: When the large language model meets programming – the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
  20. Trueskill™: a bayesian skill rating system. Advances in neural information processing systems, 19, 2006.
  21. The Oxford handbook of thinking and reasoning. Oxford University Press, 2012.
  22. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. ICML, 2022.
  23. Continual training of language models for few-shot learning. arXiv preprint arXiv:2210.05549, 2022a.
  24. Adapting a language model while preserving its general knowledge. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10177–10188, 2022b.
  25. Continual pre-training of language models. In The Eleventh International Conference on Learning Representations, 2023.
  26. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  27. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020a.
  28. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020b.
  29. Chain of code: Reasoning with a language model-augmented code emulator, 2023a.
  30. Chain-of-knowledge: Grounding large language models via dynamic knowledge adapting over heterogeneous sources. In The Twelfth International Conference on Learning Representations, 2023b.
  31. Steve-1: A generative model for text-to-behavior in minecraft. arXiv preprint arXiv:2306.00937, 2023.
  32. Mcu: A task-centric framework for open-ended agent evaluation in minecraft. arXiv preprint arXiv:2310.08367, 2023.
  33. Selecting large language model to fine-tune via rectified scaling law. arXiv preprint arXiv:2402.02314, 2024.
  34. Gradually excavating external knowledge for implicit complex question answering. In Conference on Empirical Methods in Natural Language Processing, 2023a.
  35. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b.
  36. Reacc: A retrieval-augmented code completion framework. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6227–6240, 2022.
  37. Retrieval-based prompt selection for code-related few-shot learning. In Proceedings of the 45th International Conference on Software Engineering (ICSE’23), 2023.
  38. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
  39. OpenAI. Gpt-4 technical report, 2023.
  40. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  41. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922, 2023.
  42. N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
  43. Entailment tree explanations via iterative retrieval-generation reasoner. arXiv preprint arXiv:2205.09224, 2022.
  44. Code llama: Open foundation models for code. ArXiv, abs/2308.12950, 2023.
  45. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
  46. Improving the domain adaptation of retrieval augmented generation (rag) models for open domain question answering. Transactions of the Association for Computational Linguistics, 11:1–17, 2023.
  47. G. P. Team. Palm: Scaling language modeling with pathways. arXiv preprint arXiv: 2204.02311, 2022.
  48. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  49. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. ArXiv, abs/2212.10509, 2022a.
  50. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509, 2022b.
  51. X. Wang and D. Zhou. Chain-of-thought reasoning without prompting. arXiv preprint arXiv:2402.10200, 2024.
  52. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  53. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, 2023a.
  54. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. ArXiv, abs/2311.05997, 2023b.
  55. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560, 2023c.
  56. Chain of thought prompting elicits reasoning in large language models. 36th Conference on Neural Information Processing Systems (NeurIPS 2022), 2022.
  57. Grove: a retrieval-augmented complex story generation framework with a forest of evidence. arXiv preprint arXiv:2310.05388, 2023.
  58. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
  59. Tree of thoughts: Deliberate problem solving with large language models, 2023.
  60. Plan4mc: Skill reinforcement learning and planning for open-world minecraft tasks. arXiv preprint arXiv:2303.16563, 2023.
  61. Pre-training goal-based models for sample-efficient reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024.
  62. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022.
  63. Proagent: Building proactive cooperative ai with large language models. arXiv preprint arXiv:2308.11339, 2023.
  64. Retrieving multimodal information for augmented generation: A survey. ArXiv, abs/2303.10868, 2023.
  65. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022a.
  66. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, 2023.
  67. Docprompting: Generating code by retrieving the docs. In The Eleventh International Conference on Learning Representations, 2022b.
Citations (29)

Summary

  • The paper introduces RAT, which combines chain-of-thought reasoning with retrieval mechanisms to address long-horizon generation challenges.
  • It employs iterative thought revision using dynamic external information retrieval to improve coherence and mitigate hallucinations.
  • Experimental results show significant performance gains, with improvements up to 20.94% in code generation and 16.96% in mathematical reasoning.

Synergizing Chain of Thoughts and Retrieval-Augmented Generation for Long-Horizon Tasks

Introduction

The increase in capabilities of LLMs has opened up new frontiers in AI's ability to process and generate natural language. However, their performance on long-horizon generation tasks tends to degrade, primarily due to issues with reasoning steps and factual correctness. A promising approach to address these challenges involves combining the concept of chain-of-thought (CoT) prompting with retrieval-augmented generation (RAG). This paper presents a novel method, Retrieval-Augmented Thoughts (RAT), that iteratively revises each thought step with relevant retrieved information to enhance reasoning and mitigate hallucination.

Methodology

The RAT approach introduces two key components to improve LLMs' ability on long-horizon tasks:

  1. Iterative Thought Revision with RAG: An initial chain of thoughts generated by the LLM is revised iteratively. Each thought step is evaluated for potential flaws that could benefit from external information. The method retrieves relevant information using the task prompt, the current thought step under revision, and all previous revised thoughts as queries to an external knowledge source. This ensures each revision step is informed by the most relevant and up-to-date external information.
  2. Progressive Generation: Unlike conventional methods that revise the entire thought chain in one go, RAT adopts a step-by-step revision approach. It tailors the retrieval query based on the evolving understanding of the task and past thoughts, ensuring a coherent and factually grounded thought process. This methodology is analogous to how humans refine their reasoning by seeking and incorporating new information as they progress through problem-solving steps.

Experimental Results

The RAT method's efficacy was evaluated on diverse benchmarks, including code generation, mathematical reasoning, embodied task planning, and creative writing. The experiments utilized various LLMs such as GPT-3.5, GPT-4, and CodeLLaMA-7b. Results demonstrate notable improvements across all tasks, particularly emphasizing RAT's ability to significantly enhance both reasoning and factual accuracy in generated outputs. For instance, RAT achieved substantial improvements over baseline models, including up to 20.94\% on code generation and 16.96\% on mathematical reasoning tasks, showcasing its capability to handle complex long-horizon generation tasks effectively.

Discussion and Future Directions

The success of RAT underlines the importance of integrating retrieval-augmented generation with CoT prompting to mitigate hallucination and improve reasoning in LLMs. This approach not only supports more accurate and contextually relevant generation but also highlights the potential for expanding LLMs' application in scenarios requiring deep reasoning and factual consistency.

Looking forward, the adaptability of RAT to various LLMs and tasks hints at its potential as a generalized strategy for enhancing LLM performance on long-horizon tasks. Future research could explore optimizing the retrieval process for efficiency, extending the method to more complex reasoning structures beyond linear thought chains, and refining the prompting strategy to further reduce reliance on external information retrieval while maintaining performance enhancements.

In conclusion, Retrieval-Augmented Thoughts represent a significant step forward in the pursuit of more intelligent, accurate, and reliable language generation and reasoning from LLMs, paving the way for advancements in AI's application in complex problem-solving scenarios.

Youtube Logo Streamline Icon: https://streamlinehq.com