Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives (2401.02009v3)

Published 4 Jan 2024 in cs.CL and cs.AI
Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives

Abstract: The reflection capacity of LLM has garnered extensive attention. A post-hoc prompting strategy, e.g., reflexion and self-refine, refines LLM's response based on self-evaluated or external feedback. However, recent research indicates without external feedback, LLM's intrinsic reflection is unstable. Our investigation unveils that the key bottleneck is the quality of the self-evaluated feedback. We find LLMs often exhibit overconfidence or high randomness when self-evaluate, offering stubborn or inconsistent feedback, which causes poor reflection. To remedy this, we advocate Self-Contrast: It adaptively explores diverse solving perspectives tailored to the request, contrasts the differences, and summarizes these discrepancies into a checklist which could be used to re-examine and eliminate discrepancies. Our method endows LLM with diverse perspectives to alleviate stubborn biases. Moreover, their discrepancies indicate potential errors or inherent uncertainties that LLM often overlooks. Reflecting upon these can catalyze more accurate and stable reflection. Experiments conducted on a series of reasoning and translation tasks with different LLMs serve to underscore the effectiveness and generality of our strategy.

Overview of Self-Contrast Strategy

LLMs have shown remarkable prowess in a range of tasks, particularly when supplemented with post-hoc prompting techniques that encourage self-reflection to refine responses. However, without external guidance, the self-reflection process has proven to be unreliable due to the inconsistent and overconfident nature of LLM-generated feedback. In light of these limitations, researchers have proposed a new approach, termed "Self-Contrast," aimed at improving the self-reflection mechanism in LLMs.

Enhancing LLM Self-Reflection

The proposed Self-Contrast method seeks to improve LLM response quality by introducing diverse solving perspectives that the model generates for a given problem. These multiple perspectives are then contrasted against each other to identify discrepancies. By summarizing these discrepancies into a checklist, LLMs gain a more refined instrument to revisit and revise their previous responses. This enables them to overcome biases and errors that could have been previously overlooked.

Methodology and Findings

The research includes systematic experiments testing the effectiveness of the Self-Contrast method, comparing its performance to traditional self-reflection strategies across reasoning and translation tasks. The findings indicate that Self-Contrast delivers significant improvements in performance and stability by directing the LLMs to produce varied responses and then using discrepancies between these responses as a catalyst for more accurate reflection.

Conclusions and Future Directions

Overall, the Self-Contrast approach significantly reduces the occurrence of invalid or toxic reflections where LLMs fail to correct their mistakes or inaccurately modify correct answers. Despite its promise, it is noted that the method's efficacy diminishes with smaller-scale models that lack strong instruction-following capabilities. Future work may explore external tools for comparing perspectives, offering a potentially more precise and flexible solution for LLM reflection improvement.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (94)
  1. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687.
  2. Language Models are Few-Shot Learners. In NeurIPS.
  3. Chateval: Towards better llm-based evaluators through multi-agent debate. ArXiv, abs/2308.07201.
  4. Autoagents: A framework for automatic agent generation. ArXiv, abs/2309.17288.
  5. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents. ArXiv, abs/2308.10848.
  6. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. ArXiv, abs/2211.12588.
  7. Universal self-consistency for large language model generation.
  8. Teaching large language models to self-debug. ArXiv, abs/2304.05128.
  9. Palm: Scaling language modeling with pathways. ArXiv, abs/2204.02311.
  10. Training verifiers to solve math word problems. ArXiv, abs/2110.14168.
  11. Lm vs lm: Detecting factual errors via cross examination. In Conference on Empirical Methods in Natural Language Processing.
  12. Toxicity in chatgpt: Analyzing persona-assigned language models. ArXiv, abs/2304.05335.
  13. Self-collaboration code generation via chatgpt. ArXiv, abs/2304.07590.
  14. Improving factuality and reasoning in language models through multiagent debate. ArXiv, abs/2305.14325.
  15. Understanding back-translation at scale. arXiv preprint arXiv:1808.09381.
  16. Mathematical capabilities of chatgpt. ArXiv, abs/2301.13867.
  17. Improving language model negotiation with self-play and in-context learning from ai feedback. ArXiv, abs/2305.10142.
  18. Complexity-based prompting for multi-step reasoning. ArXiv, abs/2210.00720.
  19. The capacity for moral self-correction in large language models. ArXiv, abs/2302.07459.
  20. Pal: Program-aided Language Models. ArXiv, abs/2211.10435.
  21. Tora: A tool-integrated reasoning agent for mathematical problem solving. ArXiv, abs/2309.17452.
  22. The box is in the pen: Evaluating commonsense reasoning in neural machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3662–3672, Online. Association for Computational Linguistics.
  23. Measuring mathematical problem solving with the math dataset. ArXiv, abs/2103.03874.
  24. Metagpt: Meta programming for multi-agent collaborative framework. ArXiv, abs/2308.00352.
  25. Enhancing large language models in coding through multi-perspective self-consistency. ArXiv, abs/2309.17272.
  26. Large language models can self-improve. ArXiv, abs/2210.11610.
  27. Large language models cannot self-correct reasoning yet.
  28. Mathprompter: Mathematical reasoning using large language models. ArXiv, abs/2303.05398.
  29. Self-consistency for open-ended generations. ArXiv, abs/2307.06857.
  30. Learning to reason deductively: Math word problem solving as complex relation extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
  31. Language models (mostly) know what they know. ArXiv, abs/2207.05221.
  32. Language models can solve computer tasks. ArXiv, abs/2303.17491.
  33. Large Language Models are Zero-Shot Reasoners. In Conference on Neural Information Processing Systems (NeurIPS).
  34. Camel: Communicative agents for ”mind” exploration of large scale language model society. ArXiv, abs/2303.17760.
  35. Encouraging divergent thinking in large language models through multi-agent debate. ArXiv, abs/2305.19118.
  36. Let’s verify step by step. ArXiv, abs/2305.20050.
  37. Dynamic llm-agent network: An llm-agent collaboration framework with agent team optimization. ArXiv, abs/2310.02170.
  38. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. ArXiv, abs/2308.09583.
  39. Self-refine: Iterative refinement with self-feedback. ArXiv, abs/2303.17651.
  40. Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. ArXiv, abs/2305.15852.
  41. Maf: Multi-aspect feedback for improving reasoning in large language models. ArXiv, abs/2310.12426.
  42. OpenAI. 2022. Chatgpt.
  43. OpenAI. 2023. Gpt-4 technical report.
  44. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. ArXiv, abs/2308.03188.
  45. Generative agents: Interactive simulacra of human behavior. ArXiv, abs/2304.03442.
  46. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online. Association for Computational Linguistics.
  47. Refiner: Reasoning feedback on intermediate representations. ArXiv, abs/2304.01904.
  48. Boosted prompt ensembles for large language models. ArXiv, abs/2304.05970.
  49. Toolformer: Language Models Can Teach Themselves to Use Tools. ArXiv, abs/2302.04761.
  50. Peer: A collaborative language model. ArXiv, abs/2208.11663.
  51. Synthetic prompting: Generating chain-of-thought demonstrations for large language models. ArXiv, abs/2302.00618.
  52. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. ArXiv, abs/2303.17580.
  53. Reflexion: an autonomous agent with dynamic memory and self-reflection. ArXiv, abs/2303.11366.
  54. Gpt-4 doesn’t know it’s wrong: An analysis of iterative prompting for reasoning problems.
  55. Llama: Open and Efficient Foundation Language Models. ArXiv, abs/2302.13971.
  56. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288.
  57. Can large language models really improve by self-critiquing their own plans? ArXiv, abs/2310.08118.
  58. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. ArXiv, abs/2305.04091.
  59. Making large language models better reasoners with alignment. ArXiv, abs/2309.02144.
  60. Math-shepherd: Verify and reinforce llms step-by-step without human annotations.
  61. Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023 poster, abs/2203.11171.
  62. Unleashing cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration.
  63. Chain of Thought Prompting Elicits Reasoning in Large Language Models. In Conference on Neural Information Processing Systems (NeurIPS).
  64. Generating sequences by learning to self-correct. ArXiv, abs/2211.00053.
  65. Large language models are reasoners with self-verification. ArXiv, abs/2212.09561.
  66. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. arXiv.
  67. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. ArXiv, abs/2308.08155.
  68. Self-polish: Enhance reasoning in large language models via problem refinement. ArXiv, abs/2305.14497.
  69. Decomposition enhances reasoning via self-evaluation guided decoding. ArXiv, abs/2305.00633.
  70. Zhipeng Xie and Shichao Sun. 2019. A goal-driven tree-structured neural model for math word problems. In IJCAI, pages 5299–5305.
  71. Examining inter-consistency of large language models collaboration: An in-depth analysis via debate. In Conference on Empirical Methods in Natural Language Processing.
  72. Expertprompting: Instructing large language models to be distinguished experts. ArXiv, abs/2305.14688.
  73. Re-reading improves reasoning in language models. ArXiv, abs/2309.06275.
  74. Tree of thoughts: Deliberate problem solving with large language models. ArXiv, abs/2305.10601.
  75. React: Synergizing reasoning and acting in language models. ArXiv, abs/2210.03629.
  76. Answering questions by meta-reasoning over multiple chains of thought. ArXiv, abs/2304.13007.
  77. Metamath: Bootstrap your own mathematical questions for large language models. ArXiv, abs/2309.12284.
  78. Scaling relationship on learning mathematical reasoning with large language models. ArXiv, abs/2308.01825.
  79. How well do large language models perform in arithmetic tasks? ArXiv, abs/2304.02015.
  80. Mammoth: Building math generalist models through hybrid instruction tuning. ArXiv, abs/2309.05653.
  81. Glm-130b: An Open Bilingual Pre-trained Model. ICLR 2023 poster.
  82. Opt: Open Pre-trained Transformer Language Models. ArXiv, abs/2205.01068.
  83. Data-copilot: Bridging billions of data and humans with autonomous workflow. ArXiv, abs/2306.07209.
  84. Multi-view reasoning: Consistent contrastive learning for math word problem. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1103–1116.
  85. An expression tree decoding strategy for mathematical equation generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 439–456, Singapore. Association for Computational Linguistics.
  86. A closed-loop perception, decision-making and reasoning mechanism for human-like navigation. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 4717–4724. International Joint Conferences on Artificial Intelligence Organization. Main Track.
  87. Learning to navigate in a vuca environment: Hierarchical multi-expert approach. 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9254–9261.
  88. Automatic chain of thought prompting in large language models. ArXiv, abs/2210.03493.
  89. Automatic model selection with large language models for reasoning. ArXiv, abs/2305.14333.
  90. Progressive-hint prompting improves reasoning in large language models. ArXiv, abs/2304.09797.
  91. Take a step back: Evoking reasoning via abstraction in large language models. ArXiv, abs/2310.06117.
  92. Why does chatgpt fall short in answering questions faithfully? ArXiv, abs/2304.10513.
  93. Least-to-most prompting enables complex reasoning in large language models. ArXiv, abs/2205.10625.
  94. Solving math word problems via cooperative reasoning induced language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Wenqi Zhang (41 papers)
  2. Yongliang Shen (47 papers)
  3. Linjuan Wu (7 papers)
  4. Qiuying Peng (13 papers)
  5. Jun Wang (990 papers)
  6. Yueting Zhuang (164 papers)
  7. Weiming Lu (54 papers)
Citations (31)