Papers
Topics
Authors
Recent
2000 character limit reached

CogER-Agent: Elastic Reasoning for LLMs

Updated 18 December 2025
  • CogER-Agent is a cognitive-inspired reinforcement learning agent that models hierarchical human reasoning as a finite-horizon MDP and adapts its strategy based on query complexity.
  • It employs Group Relative Policy Optimization and LoRA adaptation to fine-tune its policy, achieving significant accuracy and efficiency gains over traditional scaling methods.
  • The agent integrates tool-assisted reasoning by invoking external resources as needed, demonstrating superior performance across diverse benchmarks in both in-domain and out-of-domain tasks.

CogER-Agent is a cognitive-inspired reinforcement learning agent designed for elastic reasoning in LLMs. Developed within the CogER (Cognitive-Inspired Elastic Reasoning) framework, the agent autonomously selects among multiple reasoning strategies per query, adapting computational effort according to task complexity. CogER-Agent models hierarchical human reasoning as a Markov Decision Process (MDP) and is trained with a Group Relative Policy Optimization (GRPO) algorithm. It supports tool-assisted reasoning, allowing the LLM to invoke external resources as necessary, and demonstrates significant efficiency and accuracy gains over prior scaling and routing methods (Hu et al., 17 Dec 2025).

1. Markov Decision Process Formulation

CogER-Agent's reasoning strategy selection is formalized as a finite-horizon MDP, (S,A,T,R,π)(\mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{R}, \pi). The state at time tt is st=[x,y1:t1,Li]s_t = [x, y_{1:t-1}, L_i], where xx is the user query, y1:t1y_{1:t-1} the partial chain-of-thought (CoT), and LiL_i the inferred complexity level (Li{L1,L2,L3,L4}L_i \in \{L_1, L_2, L_3, L_4\}). Each state either contains an explicit complexity tag or maintains the information within the model's hidden state.

The action space A\mathcal{A} comprises four discrete high-level modes:

  • No_Think\mathrm{No\_Think}: Immediate answer (level-1, rapid inference)
  • Think\mathrm{Think}: Intermediate reasoning (e.g., invoking a larger LLM)
  • Extend\mathrm{Extend}: Escalated chain-of-thought
  • Delegate\mathrm{Delegate}: External tool invocation

and the vocabulary V\mathcal{V} for standard token emission, so A={No_Think,Think,Extend,Delegate}V\mathcal{A} = \{\mathrm{No\_Think}, \mathrm{Think}, \mathrm{Extend}, \mathrm{Delegate}\} \cup \mathcal{V}.

The transition function T(st,at)\mathcal{T}(s_t, a_t) deterministically updates the state, either by appending a mode-selection tag or emitting a token.

The reward function is a terminal scalar combining format validation, output correctness, and penalization for unnecessary cognitive escalation:

R(s,a)=Rformat(s,a)+Raccuracy(s,a)+Rhierarchy(s,a).\mathcal{R}(s, a) = \mathcal{R}_{\mathrm{format}}(s, a) + \mathcal{R}_{\mathrm{accuracy}}(s, a) + \mathcal{R}_{\mathrm{hierarchy}}(s, a).

Rformat\mathcal{R}_{\mathrm{format}} enforces well-formed tags, Raccuracy\mathcal{R}_{\mathrm{accuracy}} rewards correct answers, and Rhierarchy\mathcal{R}_{\mathrm{hierarchy}} is defined as b(Lmin(s))δ(Lmin(s),L(s))b(L_{\min}(s)) - \delta(L_{\min}(s), L(s)) with b(L)=0.5(L1)b(L)=0.5\,(L-1) and δ(Lmin,L)=0.2(LLmin)+\delta(L_{\min},L) = 0.2\,(L - L_{\min})_+ to encourage minimal sufficient reasoning.

2. Policy Parameterization and Group Relative Policy Optimization

The policy πθ(atst)\pi_\theta(a_t \mid s_t) is a single-headed softmax over joint actions:

πθ(atst)=exp(fθ(st)a)aexp(fθ(st)a)\pi_\theta(a_t \mid s_t) = \frac{\exp(f_\theta(s_t)_a)}{\sum_{a'} \exp(f_\theta(s_t)_{a'})}

where fθf_\theta outputs unnormalized logits.

CogER-Agent is optimized using Group Relative Policy Optimization (GRPO), a PPO-style algorithm with KL regularization. For each group of GG rollouts {oi}i=1G\{o_i\}_{i=1}^G under the old policy πθold\pi_{\theta_{\mathrm{old}}}, rewards rir_i are computed and normalized as r~i=(rimean(r))/std(r)\tilde r_i = (r_i - \mathrm{mean}(r)) / \mathrm{std}(r), giving relative advantage A^i=r~i\hat A_i = \tilde r_i to each token in oio_i. The GRPO objective is:

JGRPO(θ)=Ex,{oi}[1Gi=1Gmin{πθ(oix)πθold(oix)A^i,clip(πθ(oix)πθold(oix),1ε,1+ε)A^i}βDKL[πθπref]]\mathcal{J}_{\mathrm{GRPO}}(\theta) = \mathbb{E}_{x, \{o_i\}} \left[ \frac{1}{G} \sum_{i=1}^G \min\left\{ \frac{\pi_\theta(o_i \mid x)}{\pi_{\theta_{\mathrm{old}}}(o_i \mid x)} \hat{A}_i, \mathrm{clip}\left(\frac{\pi_\theta(o_i \mid x)}{\pi_{\theta_{\mathrm{old}}}(o_i \mid x)}, 1-\varepsilon, 1+\varepsilon\right)\hat{A}_i \right\} - \beta D_{\mathrm{KL}}[\pi_\theta \Vert \pi_{\mathrm{ref}}] \right]

where ε\varepsilon is the PPO clipping threshold, β\beta the KL penalty, and πref\pi_{\mathrm{ref}} a small reference policy.

3. Training Procedure

Query difficulty is estimated in situ by the 7B-parameter CogER-Agent using a prompt requiring level assignment (L1L_1L4L_4) prior to answer generation. For L1L_1, a direct answer is also output. For each query, G=12G=12 complete trajectories (rollouts) are sampled, collectively labeled with the terminal reward as in equations (1)–(4) and normalized within-group to yield the advantage.

Fine-tuning employs GRPO combined with LoRA adaptation (rank r=16r=16), using AdamW (5×1055 \times 10^{-5} learning rate), batch size 24×324 \times 3, and a maximum generation of $8192$ tokens. Training is performed over a single epoch using an 8,000-query mixture (2,000 each from GSM8K, MATH, CommonsenseQA, MedQA), with three random seeds ({21,26,42}\{21,26,42\}) and mean ± standard deviation reported over three runs.

4. Cognitive Tool-Assisted Reasoning (CoTool)

For level-4 (Delegate) queries, CogER-Agent invokes a tool-augmented reasoning mechanism ("CoTool"). The agent alternates between internal generation and external tool invocation:

  • Generation proceeds until the LLM emits EOS or a tool query boundary token.
  • On emission of <|end_tool_query|>, the sub-query qtoolq_{\mathrm{tool}} is extracted.
  • A dedicated tool selector, prompted with task-specific instructions, chooses the optimal RSTKit tool and JSON arguments, executes the tool, and returns the output.
  • The tool's result is appended to the CoT (as <|begin_tool_result|>...</|end_tool_result|>), resuming internal reasoning.
  • The process is iterated up to a maximum number of tool calls/turns.

Formally, the probability of an ii-th tool query is given by:

P(qtool(i)I,q,R(i1))=t=1Tq(i)P(qtool,t(i)qtool,<t(i),I,q,R(i1),Tresults)P\left(q_{\mathrm{tool}}^{(i)} \mid I, q, R^{(i-1)}\right) = \prod_{t=1}^{T_q^{(i)}} P\left(q_{\mathrm{tool},t}^{(i)} \mid q_{\mathrm{tool},<t}^{(i)}, I, q, R^{(i-1)}, T_{\mathrm{results}}\right)

where II is the instruction, R(i1)R^{(i-1)} the previous reasoning, and TresultsT_{\mathrm{results}} the last tool output.

5. Empirical Evaluation and Ablation

CogER-Agent has been evaluated against contemporary test-time scaling and routing baselines across in-domain (ID; e.g., GSM8K, MATH, CommonsenseQA, MedQA) and out-of-domain (OOD; MAWPS, CollegeMath) tasks. Performance is measured by exact match (EM), parameter usage, latency, and tokens generated.

Main Results

Baseline AVG-ID EM AVG-OOD EM
DeepSeek-R1 81.55 83.00
S1-32B 78.80 81.32
ReasonFlux-32B 68.51 86.25
CogER-Agent 89.28 93.56

CogER-Agent improves ID EM by +9.5% over DeepSeek-R1 and OOD EM by +12.8%.

Ablation by Reasoning Mode

Mode ID EM OOD EM
L1L_1 (7B) 76.28 86.23
L2L_2 (32B) 83.62 89.49
L3L_3 (QWQ) 86.75 93.13
L4L_4 (CoTool) 88.42 92.89
CogER (all) 89.28 93.56

Reward Component Ablation

Version ID EM OOD EM
Training-free prompt 86.35 92.78
w/o Rformat\mathcal{R}_{format} 87.37 93.42
w/o Rhierarchy\mathcal{R}_{hierarchy} 87.89 92.21
CogER full 89.28 93.56

Eliminating format or hierarchy terms reduces performance and promotes excessive L4L_4 delegation.

Impact of CoTool (Math Benchmarks)

Version MATH-500 EM Tool-Invoc. Rate CollegeMath EM Rate
w/o CoTool 87.20 87.93
+ CoTool 97.00 3.03% 89.04 5.17%

A 10% absolute gain in MATH-500 EM is achieved with only 3% tool invocations.

Efficiency Metrics

Method Parameters Latency (s) Tokens/query
DeepSeek-R1 671B 506.19 654.63
S1-32B 32B 273.47 946.70
ReasonFlux-32B 32B 286.97 1050.63
CogER-Agent 29.6B 118.53 489.71

CogER-Agent exhibits over 4×\times speedup compared to DeepSeek-R1, with reduced token usage.

Routing Strategy Comparison

Strategy ID EM OOD EM
Uniform random 84.21 90.28
Supervised router 84.09 90.32
CogER-RL 89.28 93.56

RL-based MDP training advanstages over naïve or purely supervised routing approaches.

6. Significance and Implications

CogER-Agent delivers hierarchical reasoning by integrating per-query dynamic strategy selection, explicit reward signals for correctness and efficiency, and automated tool invocation. The MDP-based GRPO policy robustly adapts to workload variability, realizing both computational savings and substantial accuracy improvements across diverse benchmarks. The inclusion of Cognitive Tool-Assisted Reasoning grants the agent the capacity to handle previously intractable queries with minimal external resource usage. This suggests a promising research trajectory for LLM frameworks capable of fine-grained task allocation, hierarchical control, and selective tool use, with direct evidence that reinforcement learning formulation and hybrid reward functions are superior to static or supervised-only alternatives in the context of elastic reasoning for LLMs (Hu et al., 17 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to CogER-Agent.