Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
106 tokens/sec
Gemini 2.5 Pro Premium
53 tokens/sec
GPT-5 Medium
26 tokens/sec
GPT-5 High Premium
27 tokens/sec
GPT-4o
109 tokens/sec
DeepSeek R1 via Azure Premium
91 tokens/sec
GPT OSS 120B via Groq Premium
515 tokens/sec
Kimi K2 via Groq Premium
213 tokens/sec
2000 character limit reached

CC-LEARN: Cohort-based Consistency Learning (2506.15662v1)

Published 18 Jun 2025 in cs.CL

Abstract: LLMs excel at many tasks but still struggle with consistent, robust reasoning. We introduce Cohort-based Consistency Learning (CC-Learn), a reinforcement learning framework that improves the reliability of LLM reasoning by training on cohorts of similar questions derived from shared programmatic abstractions. To enforce cohort-level consistency, we define a composite objective combining cohort accuracy, a retrieval bonus for effective problem decomposition, and a rejection penalty for trivial or invalid lookups that reinforcement learning can directly optimize, unlike supervised fine-tuning. Optimizing this reward guides the model to adopt uniform reasoning patterns across all cohort members. Experiments on challenging reasoning benchmarks (including ARC-Challenge and StrategyQA) show that CC-Learn boosts both accuracy and reasoning stability over pretrained and SFT baselines. These results demonstrate that cohort-level RL effectively enhances reasoning consistency in LLMs.

Summary

Cohort-based Consistency Learning for LLMs

The paper introduces CC-Learn, a reinforcement learning (RL) framework designed to enhance reasoning consistency and reliability in LLMs. While LLMs have demonstrated proficiency in various tasks, their performance remains inconsistent when facing paraphrased or logically equivalent questions. This inconsistency undermines their application in scenarios requiring robust and uniform reasoning. CC-Learn addresses this issue by employing a cohort-based approach, where LLMs are trained to maintain consistency across similar questions sharing a common reasoning path.

The core innovation of CC-Learn is in redefining the training objective through a composite reward system that includes cohort accuracy, effective problem decomposition bonuses, and penalties for invalid retrieval attempts. Unlike traditional supervised fine-tuning, this framework leverages RL to optimize these objectives, ensuring models adopt consistent reasoning strategies across all cohort members.

Methodology

CC-Learn employs a structured framework that involves:

  1. Cohort Construction: Questions are transformed into masked templates that highlight their logical structure. From these templates, factual variants are generated, ensuring that each cohort's questions demand the same reasoning but vary in surface details.
  2. Composite Reward System: The RL model maximizes a reward that balances accuracy across the cohort with bonuses for efficient retrieval calls and penalties for trivial or invalid queries. This discourages models from adopting shortcuts and fosters uniform reasoning patterns.
  3. Rejection Mechanism: A modified retrieval model filters out complex queries, ensuring the policy model relies on simple, verifiable lookups. This prevents models from bypassing the intended reasoning path.
  4. Execution and Evaluation: Programs generated by the model are executed on cohorts, with performance evaluated under both lenient and strict consistency criteria. This robust evaluation framework provides insights into the model's ability to generalize over variants.

Experimental Findings

The framework's efficacy is demonstrated over five diverse reasoning benchmarks, including ARC-Easy, ARC-Challenge, StrategyQA, HotpotQA, and CommonsenseQA. The results reveal significant improvements in reasoning consistency and accuracy, with absolute performance gains ranging from 5% to 10%. The structured approach of CC-Learn notably enhances reasoning stability, as evidenced by a 47% preference rate for its reasoning paths in a human evaluation paper.

Implications and Future Directions

CC-Learn showcases the potential of reinforcement learning in addressing logical inconsistencies in LLMs. By structuring training around cohorts of similar questions, the framework encourages the adoption of sound reasoning strategies that are both generalizable and adaptable to variations. This methodology holds promise for advancing applications in education, decision-making, and any domain demanding reliable model predictions.

Future developments could explore alternative model configurations or more sophisticated retrieval architectures to further enhance reasoning fidelity. Additionally, expanding the framework to handle larger cohorts or more complex reasoning tasks could refine its impact. As consistency and reliability remain critical for the practical deployment of LLMs, approaches like CC-Learn are central to advancing AI’s integration into society.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.