CogER-Agent: Elastic Reasoning for LLMs
- CogER-Agent is a cognitive-inspired reinforcement learning agent that models hierarchical human reasoning as a finite-horizon MDP and adapts its strategy based on query complexity.
- It employs Group Relative Policy Optimization and LoRA adaptation to fine-tune its policy, achieving significant accuracy and efficiency gains over traditional scaling methods.
- The agent integrates tool-assisted reasoning by invoking external resources as needed, demonstrating superior performance across diverse benchmarks in both in-domain and out-of-domain tasks.
CogER-Agent is a cognitive-inspired reinforcement learning agent designed for elastic reasoning in LLMs. Developed within the CogER (Cognitive-Inspired Elastic Reasoning) framework, the agent autonomously selects among multiple reasoning strategies per query, adapting computational effort according to task complexity. CogER-Agent models hierarchical human reasoning as a Markov Decision Process (MDP) and is trained with a Group Relative Policy Optimization (GRPO) algorithm. It supports tool-assisted reasoning, allowing the LLM to invoke external resources as necessary, and demonstrates significant efficiency and accuracy gains over prior scaling and routing methods (Hu et al., 17 Dec 2025).
1. Markov Decision Process Formulation
CogER-Agent's reasoning strategy selection is formalized as a finite-horizon MDP, . The state at time is , where is the user query, the partial chain-of-thought (CoT), and the inferred complexity level (). Each state either contains an explicit complexity tag or maintains the information within the model's hidden state.
The action space comprises four discrete high-level modes:
- : Immediate answer (level-1, rapid inference)
- : Intermediate reasoning (e.g., invoking a larger LLM)
- : Escalated chain-of-thought
- : External tool invocation
and the vocabulary for standard token emission, so .
The transition function deterministically updates the state, either by appending a mode-selection tag or emitting a token.
The reward function is a terminal scalar combining format validation, output correctness, and penalization for unnecessary cognitive escalation:
enforces well-formed tags, rewards correct answers, and is defined as with and to encourage minimal sufficient reasoning.
2. Policy Parameterization and Group Relative Policy Optimization
The policy is a single-headed softmax over joint actions:
where outputs unnormalized logits.
CogER-Agent is optimized using Group Relative Policy Optimization (GRPO), a PPO-style algorithm with KL regularization. For each group of rollouts under the old policy , rewards are computed and normalized as , giving relative advantage to each token in . The GRPO objective is:
where is the PPO clipping threshold, the KL penalty, and a small reference policy.
3. Training Procedure
Query difficulty is estimated in situ by the 7B-parameter CogER-Agent using a prompt requiring level assignment (–) prior to answer generation. For , a direct answer is also output. For each query, complete trajectories (rollouts) are sampled, collectively labeled with the terminal reward as in equations (1)–(4) and normalized within-group to yield the advantage.
Fine-tuning employs GRPO combined with LoRA adaptation (rank ), using AdamW ( learning rate), batch size , and a maximum generation of $8192$ tokens. Training is performed over a single epoch using an 8,000-query mixture (2,000 each from GSM8K, MATH, CommonsenseQA, MedQA), with three random seeds () and mean ± standard deviation reported over three runs.
4. Cognitive Tool-Assisted Reasoning (CoTool)
For level-4 (Delegate) queries, CogER-Agent invokes a tool-augmented reasoning mechanism ("CoTool"). The agent alternates between internal generation and external tool invocation:
- Generation proceeds until the LLM emits EOS or a tool query boundary token.
- On emission of
<|end_tool_query|>, the sub-query is extracted. - A dedicated tool selector, prompted with task-specific instructions, chooses the optimal RSTKit tool and JSON arguments, executes the tool, and returns the output.
- The tool's result is appended to the CoT (as
<|begin_tool_result|>...</|end_tool_result|>), resuming internal reasoning. - The process is iterated up to a maximum number of tool calls/turns.
Formally, the probability of an -th tool query is given by:
where is the instruction, the previous reasoning, and the last tool output.
5. Empirical Evaluation and Ablation
CogER-Agent has been evaluated against contemporary test-time scaling and routing baselines across in-domain (ID; e.g., GSM8K, MATH, CommonsenseQA, MedQA) and out-of-domain (OOD; MAWPS, CollegeMath) tasks. Performance is measured by exact match (EM), parameter usage, latency, and tokens generated.
Main Results
| Baseline | AVG-ID EM | AVG-OOD EM |
|---|---|---|
| DeepSeek-R1 | 81.55 | 83.00 |
| S1-32B | 78.80 | 81.32 |
| ReasonFlux-32B | 68.51 | 86.25 |
| CogER-Agent | 89.28 | 93.56 |
CogER-Agent improves ID EM by +9.5% over DeepSeek-R1 and OOD EM by +12.8%.
Ablation by Reasoning Mode
| Mode | ID EM | OOD EM |
|---|---|---|
| (7B) | 76.28 | 86.23 |
| (32B) | 83.62 | 89.49 |
| (QWQ) | 86.75 | 93.13 |
| (CoTool) | 88.42 | 92.89 |
| CogER (all) | 89.28 | 93.56 |
Reward Component Ablation
| Version | ID EM | OOD EM |
|---|---|---|
| Training-free prompt | 86.35 | 92.78 |
| w/o | 87.37 | 93.42 |
| w/o | 87.89 | 92.21 |
| CogER full | 89.28 | 93.56 |
Eliminating format or hierarchy terms reduces performance and promotes excessive delegation.
Impact of CoTool (Math Benchmarks)
| Version | MATH-500 EM | Tool-Invoc. Rate | CollegeMath EM | Rate |
|---|---|---|---|---|
| w/o CoTool | 87.20 | – | 87.93 | – |
| + CoTool | 97.00 | 3.03% | 89.04 | 5.17% |
A 10% absolute gain in MATH-500 EM is achieved with only 3% tool invocations.
Efficiency Metrics
| Method | Parameters | Latency (s) | Tokens/query |
|---|---|---|---|
| DeepSeek-R1 | 671B | 506.19 | 654.63 |
| S1-32B | 32B | 273.47 | 946.70 |
| ReasonFlux-32B | 32B | 286.97 | 1050.63 |
| CogER-Agent | 29.6B | 118.53 | 489.71 |
CogER-Agent exhibits over 4 speedup compared to DeepSeek-R1, with reduced token usage.
Routing Strategy Comparison
| Strategy | ID EM | OOD EM |
|---|---|---|
| Uniform random | 84.21 | 90.28 |
| Supervised router | 84.09 | 90.32 |
| CogER-RL | 89.28 | 93.56 |
RL-based MDP training advanstages over naïve or purely supervised routing approaches.
6. Significance and Implications
CogER-Agent delivers hierarchical reasoning by integrating per-query dynamic strategy selection, explicit reward signals for correctness and efficiency, and automated tool invocation. The MDP-based GRPO policy robustly adapts to workload variability, realizing both computational savings and substantial accuracy improvements across diverse benchmarks. The inclusion of Cognitive Tool-Assisted Reasoning grants the agent the capacity to handle previously intractable queries with minimal external resource usage. This suggests a promising research trajectory for LLM frameworks capable of fine-grained task allocation, hierarchical control, and selective tool use, with direct evidence that reinforcement learning formulation and hybrid reward functions are superior to static or supervised-only alternatives in the context of elastic reasoning for LLMs (Hu et al., 17 Dec 2025).