Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

$\texttt{LM}^\texttt{2}$: A Simple Society of Language Models Solves Complex Reasoning (2404.02255v1)

Published 2 Apr 2024 in cs.CL and cs.AI
$\texttt{LM}^\texttt{2}$: A Simple Society of Language Models Solves Complex Reasoning

Abstract: Despite demonstrating emergent reasoning abilities, LLMs often lose track of complex, multi-step reasoning. Existing studies show that providing guidance via decomposing the original question into multiple subproblems elicits more robustness in LLM reasoning -- a decomposer generates the subproblems, and a solver solves each of these subproblems. However, these techniques fail to accommodate coordination between the decomposer and the solver modules (either in a single model or different specialized ones) -- the decomposer does not keep track of the ability of the solver to follow the decomposed reasoning. In this paper, we propose LM2 to address these challenges. LM2 modularizes the decomposition, solution, and verification into three different LLMs. The decomposer module identifies the key concepts necessary to solve the problem and generates step-by-step subquestions according to the reasoning requirement. The solver model generates the solution to the subproblems that are then checked by the verifier module; depending upon the feedback from the verifier, the reasoning context is constructed using the subproblems and the solutions. These models are trained to coordinate using policy learning. Exhaustive experimentation suggests the superiority of LM2 over existing methods on in- and out-domain reasoning problems, outperforming the best baselines by $8.1\%$ on MATH, $7.71\%$ on JEEBench, and $9.7\%$ on MedQA problems (code available at https://github.com/LCS2-IIITD/Language_Model_Multiplex).

Improving Coordination Between Decomposer, Solver, and Verifier Models in LLMs for Complex Reasoning

Introduction to the Proposed LM Model

In the field of AI research focused on LLMs, the challenge of executing complex, multi-step reasoning tasks has been a significant hurdle. Recent contributions in this space have offered novel solutions, among which the integration of decomposer and solver models has shown considerable promise. However, these solutions often face limitations due to the lack of an effective feedback loop between the decomposing and solving phases of problem-solving. Addressing this void, the paper introduces a pioneering multi-LLM framework titled "LLM Multiplex" (LM) that modularly incorporates decomposition, solution, and verification tasks across three distinct LLMs.

Key Contributions of LM

  1. Introduction of a Verifier Model: A verifier model checks the solutions provided by the solver model, introducing an additional layer of scrutiny that enriches the reasoning feedback loop.
  2. Dynamic Coordination Through Policy Learning: The LM framework fine-tunes the interaction among its components (decomposer, solver, verifier) utilizing policy learning, a step that showcases a novel method of improving model coordination based on direct feedback.
  3. Enhanced Performance on Complex Reasoning Tasks: The experimentation underscores LM's superiority over existing methods across in-domain and out-domain reasoning tasks, showcasing an improvement of 8.1% accuracy on the MATH dataset and notable gains on JEEBench and MedQA problems.

Methodological Innovations

The LM approach distinguishes itself through several methodological innovations:

  • The decomposer identifies necessary concepts and generates step-by-step sub-questions, guided by the capabilities and feedback of the solver and verifier.
  • The solver, powered by an unmodified GPT-3.5 model, focuses on generating responses to these sub-questions.
  • The verifier, fine-tuned on distinct classifications of potential answer errors, provides nuanced feedback essential for refining the overall reasoning process.

Training and Results

The paper details an exhaustive training regimen involving the fine-tuning of the LLaMA-2 model used for both the decomposer and the verifier components. A notable aspect of the training process is the innovative use of Proximal Policy Optimization for integrating feedback, a method contributing significantly to the model's robustness in handling complex reasoning sequences.

Empirical results reveal LM's effectiveness, with marked improvements over baseline models on several benchmark datasets. Most impressively, despite being fine-tuned on mathematical reasoning problems, LM displays exceptional generalizability across diverse reasoning tasks, including in areas such as chemistry and medical question answering.

A Discussion on Implications and Future Directions

The introduction of LM sets a new benchmark in the pursuit of more effective multi-step reasoning within LLMs. The model's ability to dynamically adjust the decomposition of questions based on ongoing solver performance and verifier feedback opens new avenues for research into LLMs' capabilities in handling intricate reasoning tasks.

Looking forward, the LM framework's modularity hints at the potential for incorporating more specialized models for each of its components, possibly enabling even finer-grained reasoning abilities. Moreover, the structured feedback mechanism introduced by the verifier model suggests intriguing research directions in improving error detection and correction methodologies within LLMs.

In conclusion, the LLM Multiplex framework represents a significant step forward in the development of LLMs capable of nuanced and complex reasoning across varied domains. With its innovative structure and promising initial results, LM paves the way for future advances in AI's reasoning capabilities, offering a blueprint for enhancing the interplay between decomposing, solving, and verifying components of LLM-based reasoning systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Have llms advanced enough? a challenging problem solving benchmark for large language models.
  2. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
  3. Training verifiers to solve math word problems.
  4. Frugal lms trained to invoke symbolic solvers achieve parameter-efficient arithmetic reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17951–17959.
  5. Measuring mathematical problem solving with the math dataset. NeurIPS.
  6. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  7. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.
  8. Small language models fine-tuned to coordinate larger language models improve complex reasoning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3675–3691, Singapore. Association for Computational Linguistics.
  9. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024.
  10. Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations.
  11. Making large language models better reasoners with step-aware verifier.
  12. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114.
  13. OpenAI. 2023. Gpt-4 technical report.
  14. Proximal policy optimization algorithms. CoRR, abs/1707.06347.
  15. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300.
  16. Towards verifiable text generation with evolving memory and self-reflection. arXiv preprint arXiv:2312.09075.
  17. Denis Tarasov and Kumar Shridhar. 2024. Distilling llms’ decomposition abilities into compact language models.
  18. Openmathinstruct-1: A 1.8 million math instruction tuning dataset. arXiv preprint arXiv:2402.10176.
  19. Llama: Open and efficient foundation language models.
  20. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  21. Large language models are better reasoners with self-verification.
  22. Divide-or-conquer? which part should you distill your llm?
  23. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36.
  24. Progressive-hint prompting improves reasoning in large language models.
  25. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Gurusha Juneja (5 papers)
  2. Subhabrata Dutta (24 papers)
  3. Tanmoy Chakraborty (224 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com