Cohort-based Consistency Learning for LLMs
The paper introduces CC-Learn, a reinforcement learning (RL) framework designed to enhance reasoning consistency and reliability in LLMs. While LLMs have demonstrated proficiency in various tasks, their performance remains inconsistent when facing paraphrased or logically equivalent questions. This inconsistency undermines their application in scenarios requiring robust and uniform reasoning. CC-Learn addresses this issue by employing a cohort-based approach, where LLMs are trained to maintain consistency across similar questions sharing a common reasoning path.
The core innovation of CC-Learn is in redefining the training objective through a composite reward system that includes cohort accuracy, effective problem decomposition bonuses, and penalties for invalid retrieval attempts. Unlike traditional supervised fine-tuning, this framework leverages RL to optimize these objectives, ensuring models adopt consistent reasoning strategies across all cohort members.
Methodology
CC-Learn employs a structured framework that involves:
- Cohort Construction: Questions are transformed into masked templates that highlight their logical structure. From these templates, factual variants are generated, ensuring that each cohort's questions demand the same reasoning but vary in surface details.
- Composite Reward System: The RL model maximizes a reward that balances accuracy across the cohort with bonuses for efficient retrieval calls and penalties for trivial or invalid queries. This discourages models from adopting shortcuts and fosters uniform reasoning patterns.
- Rejection Mechanism: A modified retrieval model filters out complex queries, ensuring the policy model relies on simple, verifiable lookups. This prevents models from bypassing the intended reasoning path.
- Execution and Evaluation: Programs generated by the model are executed on cohorts, with performance evaluated under both lenient and strict consistency criteria. This robust evaluation framework provides insights into the model's ability to generalize over variants.
Experimental Findings
The framework's efficacy is demonstrated over five diverse reasoning benchmarks, including ARC-Easy, ARC-Challenge, StrategyQA, HotpotQA, and CommonsenseQA. The results reveal significant improvements in reasoning consistency and accuracy, with absolute performance gains ranging from 5% to 10%. The structured approach of CC-Learn notably enhances reasoning stability, as evidenced by a 47% preference rate for its reasoning paths in a human evaluation paper.
Implications and Future Directions
CC-Learn showcases the potential of reinforcement learning in addressing logical inconsistencies in LLMs. By structuring training around cohorts of similar questions, the framework encourages the adoption of sound reasoning strategies that are both generalizable and adaptable to variations. This methodology holds promise for advancing applications in education, decision-making, and any domain demanding reliable model predictions.
Future developments could explore alternative model configurations or more sophisticated retrieval architectures to further enhance reasoning fidelity. Additionally, expanding the framework to handle larger cohorts or more complex reasoning tasks could refine its impact. As consistency and reliability remain critical for the practical deployment of LLMs, approaches like CC-Learn are central to advancing AI’s integration into society.