CogER-Agent: Elastic Reasoning for LLMs

Updated 18 December 2025

CogER-Agent is a cognitive-inspired reinforcement learning agent that models hierarchical human reasoning as a finite-horizon MDP and adapts its strategy based on query complexity.
It employs Group Relative Policy Optimization and LoRA adaptation to fine-tune its policy, achieving significant accuracy and efficiency gains over traditional scaling methods.
The agent integrates tool-assisted reasoning by invoking external resources as needed, demonstrating superior performance across diverse benchmarks in both in-domain and out-of-domain tasks.

CogER-Agent is a cognitive-inspired reinforcement learning agent designed for elastic reasoning in LLMs. Developed within the CogER (Cognitive-Inspired Elastic Reasoning) framework, the agent autonomously selects among multiple reasoning strategies per query, adapting computational effort according to task complexity. CogER-Agent models hierarchical human reasoning as a Markov Decision Process (MDP) and is trained with a Group Relative Policy Optimization (GRPO) algorithm. It supports tool-assisted reasoning, allowing the LLM to invoke external resources as necessary, and demonstrates significant efficiency and accuracy gains over prior scaling and routing methods (Hu et al., 17 Dec 2025).

1. Markov Decision Process Formulation

CogER-Agent's reasoning strategy selection is formalized as a finite-horizon MDP, $(\mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{R}, \pi)$ . The state at time $t$ is $s_t = [x, y_{1:t-1}, L_i]$ , where $x$ is the user query, $y_{1:t-1}$ the partial chain-of-thought (CoT), and $L_i$ the inferred complexity level ( $L_i \in \{L_1, L_2, L_3, L_4\}$ ). Each state either contains an explicit complexity tag or maintains the information within the model's hidden state.

The action space $\mathcal{A}$ comprises four discrete high-level modes:

$\mathrm{No\_Think}$ : Immediate answer (level-1, rapid inference)
$\mathrm{Think}$ : Intermediate reasoning (e.g., invoking a larger LLM)
$\mathrm{Extend}$ : Escalated chain-of-thought
$\mathrm{Delegate}$ : External tool invocation

and the vocabulary $\mathcal{V}$ for standard token emission, so $\mathcal{A} = \{\mathrm{No\_Think}, \mathrm{Think}, \mathrm{Extend}, \mathrm{Delegate}\} \cup \mathcal{V}$ .

The transition function $\mathcal{T}(s_t, a_t)$ deterministically updates the state, either by appending a mode-selection tag or emitting a token.

The reward function is a terminal scalar combining format validation, output correctness, and penalization for unnecessary cognitive escalation:

$\mathcal{R}(s, a) = \mathcal{R}_{\mathrm{format}}(s, a) + \mathcal{R}_{\mathrm{accuracy}}(s, a) + \mathcal{R}_{\mathrm{hierarchy}}(s, a).$

$\mathcal{R}_{\mathrm{format}}$ enforces well-formed tags, $\mathcal{R}_{\mathrm{accuracy}}$ rewards correct answers, and $\mathcal{R}_{\mathrm{hierarchy}}$ is defined as $b(L_{\min}(s)) - \delta(L_{\min}(s), L(s))$ with $b(L)=0.5\,(L-1)$ and $\delta(L_{\min},L) = 0.2\,(L - L_{\min})_+$ to encourage minimal sufficient reasoning.

2. Policy Parameterization and Group Relative Policy Optimization

The policy $\pi_\theta(a_t \mid s_t)$ is a single-headed softmax over joint actions:

$\pi_\theta(a_t \mid s_t) = \frac{\exp(f_\theta(s_t)_a)}{\sum_{a'} \exp(f_\theta(s_t)_{a'})}$

where $f_\theta$ outputs unnormalized logits.

CogER-Agent is optimized using Group Relative Policy Optimization (GRPO), a PPO-style algorithm with KL regularization. For each group of $G$ rollouts $\{o_i\}_{i=1}^G$ under the old policy $\pi_{\theta_{\mathrm{old}}}$ , rewards $r_i$ are computed and normalized as $\tilde r_i = (r_i - \mathrm{mean}(r)) / \mathrm{std}(r)$ , giving relative advantage $\hat A_i = \tilde r_i$ to each token in $o_i$ . The GRPO objective is:

$\mathcal{J}_{\mathrm{GRPO}}(\theta) = \mathbb{E}_{x, \{o_i\}} \left[ \frac{1}{G} \sum_{i=1}^G \min\left\{ \frac{\pi_\theta(o_i \mid x)}{\pi_{\theta_{\mathrm{old}}}(o_i \mid x)} \hat{A}_i, \mathrm{clip}\left(\frac{\pi_\theta(o_i \mid x)}{\pi_{\theta_{\mathrm{old}}}(o_i \mid x)}, 1-\varepsilon, 1+\varepsilon\right)\hat{A}_i \right\} - \beta D_{\mathrm{KL}}[\pi_\theta \Vert \pi_{\mathrm{ref}}] \right]$

where $\varepsilon$ is the PPO clipping threshold, $\beta$ the KL penalty, and $\pi_{\mathrm{ref}}$ a small reference policy.

3. Training Procedure

Query difficulty is estimated in situ by the 7B-parameter CogER-Agent using a prompt requiring level assignment ( $L_1$ – $L_4$ ) prior to answer generation. For $L_1$ , a direct answer is also output. For each query, $G=12$ complete trajectories (rollouts) are sampled, collectively labeled with the terminal reward as in equations (1)–(4) and normalized within-group to yield the advantage.

Fine-tuning employs GRPO combined with LoRA adaptation (rank $r=16$ ), using AdamW ( $5 \times 10^{-5}$ learning rate), batch size $24 \times 3$ , and a maximum generation of $8192$ tokens. Training is performed over a single epoch using an 8,000-query mixture (2,000 each from GSM8K, MATH, CommonsenseQA, MedQA), with three random seeds ( $\{21,26,42\}$ ) and mean ± standard deviation reported over three runs.

4. Cognitive Tool-Assisted Reasoning (CoTool)

For level-4 (Delegate) queries, CogER-Agent invokes a tool-augmented reasoning mechanism ("CoTool"). The agent alternates between internal generation and external tool invocation:

Generation proceeds until the LLM emits EOS or a tool query boundary token.
On emission of <|end_tool_query|>, the sub-query $q_{\mathrm{tool}}$ is extracted.
A dedicated tool selector, prompted with task-specific instructions, chooses the optimal RSTKit tool and JSON arguments, executes the tool, and returns the output.
The tool's result is appended to the CoT (as <|begin_tool_result|>...</|end_tool_result|>), resuming internal reasoning.
The process is iterated up to a maximum number of tool calls/turns.

Formally, the probability of an $i$ -th tool query is given by:

$P\left(q_{\mathrm{tool}}^{(i)} \mid I, q, R^{(i-1)}\right) = \prod_{t=1}^{T_q^{(i)}} P\left(q_{\mathrm{tool},t}^{(i)} \mid q_{\mathrm{tool},<t}^{(i)}, I, q, R^{(i-1)}, T_{\mathrm{results}}\right)$

where $I$ is the instruction, $R^{(i-1)}$ the previous reasoning, and $T_{\mathrm{results}}$ the last tool output.

5. Empirical Evaluation and Ablation

CogER-Agent has been evaluated against contemporary test-time scaling and routing baselines across in-domain (ID; e.g., GSM8K, MATH, CommonsenseQA, MedQA) and out-of-domain (OOD; MAWPS, CollegeMath) tasks. Performance is measured by exact match (EM), parameter usage, latency, and tokens generated.

Main Results

Baseline	AVG-ID EM	AVG-OOD EM
DeepSeek-R1	81.55	83.00
S1-32B	78.80	81.32
ReasonFlux-32B	68.51	86.25
CogER-Agent	89.28	93.56

CogER-Agent improves ID EM by +9.5% over DeepSeek-R1 and OOD EM by +12.8%.

Ablation by Reasoning Mode

Mode	ID EM	OOD EM
$L_1$ (7B)	76.28	86.23
$L_2$ (32B)	83.62	89.49
$L_3$ (QWQ)	86.75	93.13
$L_4$ (CoTool)	88.42	92.89
CogER (all)	89.28	93.56

Reward Component Ablation

Version	ID EM	OOD EM
Training-free prompt	86.35	92.78
w/o $\mathcal{R}_{format}$	87.37	93.42
w/o $\mathcal{R}_{hierarchy}$	87.89	92.21
CogER full	89.28	93.56

Eliminating format or hierarchy terms reduces performance and promotes excessive $L_4$ delegation.

Impact of CoTool (Math Benchmarks)

Version	MATH-500 EM	Tool-Invoc. Rate	CollegeMath EM	Rate
w/o CoTool	87.20	–	87.93	–
+ CoTool	97.00	3.03%	89.04	5.17%

A 10% absolute gain in MATH-500 EM is achieved with only 3% tool invocations.

Efficiency Metrics

Method	Parameters	Latency (s)	Tokens/query
DeepSeek-R1	671B	506.19	654.63
S1-32B	32B	273.47	946.70
ReasonFlux-32B	32B	286.97	1050.63
CogER-Agent	29.6B	118.53	489.71

CogER-Agent exhibits over 4 $\times$ speedup compared to DeepSeek-R1, with reduced token usage.

Routing Strategy Comparison

Strategy	ID EM	OOD EM
Uniform random	84.21	90.28
Supervised router	84.09	90.32
CogER-RL	89.28	93.56

RL-based MDP training advanstages over naïve or purely supervised routing approaches.

6. Significance and Implications

CogER-Agent delivers hierarchical reasoning by integrating per-query dynamic strategy selection, explicit reward signals for correctness and efficiency, and automated tool invocation. The MDP-based GRPO policy robustly adapts to workload variability, realizing both computational savings and substantial accuracy improvements across diverse benchmarks. The inclusion of Cognitive Tool-Assisted Reasoning grants the agent the capacity to handle previously intractable queries with minimal external resource usage. This suggests a promising research trajectory for LLM frameworks capable of fine-grained task allocation, hierarchical control, and selective tool use, with direct evidence that reinforcement learning formulation and hybrid reward functions are superior to static or supervised-only alternatives in the context of elastic reasoning for LLMs (Hu et al., 17 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Beyond Fast and Slow: Cognitive-Inspired Elastic Reasoning for Large Language Models (2025)

CogER-Agent: Elastic Reasoning for LLMs

1. Markov Decision Process Formulation

2. Policy Parameterization and Group Relative Policy Optimization

3. Training Procedure

4. Cognitive Tool-Assisted Reasoning (CoTool)

5. Empirical Evaluation and Ablation

Main Results

Ablation by Reasoning Mode

Reward Component Ablation

Impact of CoTool (Math Benchmarks)

Efficiency Metrics

Routing Strategy Comparison

6. Significance and Implications

Whiteboard

Follow Topic

Continue Learning

CogER-Agent: Elastic Reasoning for LLMs

1. Markov Decision Process Formulation

2. Policy Parameterization and Group Relative Policy Optimization

3. Training Procedure

4. Cognitive Tool-Assisted Reasoning (CoTool)

5. Empirical Evaluation and Ablation

Main Results

Ablation by Reasoning Mode

Reward Component Ablation

Impact of CoTool (Math Benchmarks)

Efficiency Metrics

Routing Strategy Comparison

6. Significance and Implications

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics