Joint Optimization of Reasoning and Dual-Memory for Self-Learning Diagnostic Agent

Published 8 Apr 2026 in cs.CL | (2604.07269v1)

Abstract: Clinical expertise improves not only by acquiring medical knowledge, but by accumulating experience that yields reusable diagnostic patterns. Recent LLMs-based diagnostic agents have shown promising progress in clinical reasoning for decision support. However, most approaches treat cases independently, limiting experience reuse and continual adaptation. We propose SEA, a self-learning diagnostic agent with cognitively inspired dual-memory module. We design a reinforcement training framework tailored to our designed agent for joint optimization of reasoning and memory management. We evaluate SEA in two complementary settings. On standard evaluation with MedCaseReasoning dataset, SEA achieves 92.46% accuracy, outperforming the strongest baseline by +19.6%, demonstrating the benefit of jointly optimizing reasoning and memory. On the long-horizon with ER-Reason dataset, SEA attains the best final accuracy (0.7214) and the largest improvement (+0.35 Acc@100), while baseline methods show limited or unstable gains. Expert evaluation further indicates that rules consolidated from SEA show strong clinical correctness, usefulness and trust, suggesting that the induced rules in dual-memory module are reliable and practically meaningful. Overall, SEA improves both diagnostic reasoning ability and continual learning by effectively transforming experience into reusable knowledge.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper demonstrates that a dual-memory architecture jointly optimized with diagnostic reasoning via reinforcement learning leads to remarkable accuracy gains.
It introduces Group Relative Policy Optimization to balance diagnostic rewards and memory efficiency, ensuring effective adaptation to evolving clinical cases.
Empirical evaluations on MedCaseReasoning and ER-Reason datasets confirm the agent's superior performance and its ability to generalize in real-world clinical environments.

Joint Optimization of Reasoning and Dual-Memory for Self-Learning Diagnostic Agent

Motivation and Problem Setting

LLM-based clinical diagnostic agents have demonstrated substantial progress in automated reasoning and decision support. However, prevailing approaches consistently exhibit limitations in experience-driven adaptation; cases are processed independently, precluding mechanisms for experience accumulation and reuse. These shortcomings are pronounced in real clinical practice, where expert clinicians leverage both (i) formal medical knowledge and (ii) experiential diagnostic heuristics distilled from prior cases, especially for rare diseases and presentations with subtle cues. To address this gap, the paper introduces a self-learning agent trained via reinforcement learning to jointly optimize diagnostic reasoning and the management of a cognitively inspired dual-memory module.

Dual-Memory Architecture and Policy Framework

The proposed agent architecture is centered on two memory clusters: a bounded-capacity short-term cluster retaining recent cases annotated with outcomes, and a long-term cluster storing abstracted diagnostic rules synthesized from evicted cases. The dual-memory design reflects clinical reasoning patterns, enforcing explicit decisions about what information to retain, consolidate, or abstract.

At each decision round $t$ , the policy model $\pi_\theta$ observes a new patient case $x_t$ , possibly invokes memory operations (list, append, pop, consolidate), and emits the final output $o_t$ (diagnosis and structured reasoning). Policy updates are performed via Group Relative Policy Optimization (GRPO), leveraging round-wise rewards that combine diagnostic correctness and memory efficiency.

Figure 1: Agent structure: at each round, the model processes a patient case, operates on dual-memory, and updates policy via sampled rollouts and rewards.

Reward Modeling and Optimization

The reward function is decomposed into diagnostic and memory components:

Diagnostic Reward: $r_t^\text{diag}$ provides strong supervision for prediction correctness (+5 for correct, -5 otherwise).
Memory Reward: $r_t^\text{mem}$ penalizes large occupancy of short-term memory, enforcing timely consolidation; $-\alpha \cdot |\mathcal{M}^{S}_t|/K$ with explicit schedule interpolation across rounds.

A linear interpolation is employed to modulate the weighting of rewards across the temporal stream, accentuating memory formation in early rounds and diagnostic accuracy in later stages. This round-wise reward shaping mitigates cold-start variance and incentivizes staged learning.

Experimental Evaluation

Standard Evaluation: MedCaseReasoning Dataset

Evaluation on MedCaseReasoning (closed-set, rare and complex cases) demonstrates pronounced gains:

SEA (Qwen-8B) achieves 92.46% accuracy (+19.6% improvement over strongest baseline) and 86.81 macro-F1.
RL with outcome-only rewards yields modest improvements (77.32%), confirming that guided memory optimization is essential.
Naive memory-augmented baselines degrade performance (e.g., ReAct+ShortTerm-Memory drops to 42.51%), underscoring the necessity of structured memory management.

Long-Horizon Evaluation: ER-Reason Dataset

In stream-based deployment settings (ER-Reason), the self-learning agent outperforms all variants:

SEA (Qwen-8B) attains final accuracy of 0.7214 and $\Delta$ Acc@100 of +0.35.
Dual-memory augmentation is plug-and-play and exhibits generalizability, significantly boosting adaptation in GPT-5.2 and SFT-based models, but unregulated memory use causes retrieval-induced noise.
Figure 2 displays accuracy trajectories over 100 rounds.
Figure 2: Accuracy trajectories for representative methods, illustrating sustained improvement and superior final performance of SEA.

Ablation and Expert Evaluation

Ablation experiments on MedCaseReasoning validate the criticality of each module:

Removing dual-memory or memory-management reward results in sharp declines (e.g., accuracy drops to 0.4741 below zero-shot baseline).
Short-term memory enhances alignment with instance context; long-term memory provides generalization, but both require regulated management.

Clinical expert evaluation (Likert scale, 5-point) yields high scores for rule practical utility (3.946), clinical correctness (3.865), and trust (4.216), confirming that induced rules are reliable and applicable. Lower case-relevance (2.595) aligns with the intentional abstraction for generalization.

Theoretical and Practical Implications

This work establishes that structured dual-memory enables continual adaptation in diagnostic agents without parameter updates, challenging the dependency on static context and retraining. Experience-driven memory consolidation is fundamental for sustained learning in deployment environments with distributional drift and sparse supervision. The dual-memory module is externally implementable, facilitating rapid integration with diverse LLM architectures. The reinforcement framework jointly optimizes reasoning and memory, ensuring traceable, auditable decision-making.

Future Directions

Key extensions include broadening the action space (e.g., evidence gathering), integrating multi-modal inputs (lab findings, imaging), establishing safety and privacy protocols for clinical deployment, and formalizing memory consolidation algorithms for richer abstraction. Exploration of hierarchical memory structures and adaptive compression schemes will improve scalability in larger, more heterogeneous environments. Long-horizon RL remains pivotal for robust adaptation under non-stationary clinical streams.

Conclusion

The joint optimization of reasoning and dual-memory mechanisms demonstrably improves both static and deployment-phase diagnostic accuracy, enabling agents to transform experience into reusable diagnostic knowledge. The agent substantially outperforms baselines across multiple metrics and settings, with dual-memory providing modular, generalizable adaptation capabilities and memory-regulated reinforcement driving sustained improvement (2604.07269).