- The paper introduces RSCB-MC, a risk-sensitive memory controller that actively abstains from unsafe memory retrieval to enhance LLM coding agent safety.
- It employs a pattern-variant-episode schema and a 16-feature contextual state for fine-grained compatibility assessment, achieving 0% false-positive injections.
- Empirical results demonstrate improved non-oracle success rates and reduced operational risk, underscoring the need for safety-centric memory usage over conventional retrieval.
Risk-Sensitive Contextual Bandit Memory Control for LLM-Based Coding Agents
LLM-based coding agents rely increasingly on memory systems to retrieve historical debugging experiences and operational knowledge. However, superficial retrieval based on lexical or surface-level similarity often leads to the unsafe injection of past solutions, particularly in complex debugging tasks where symptom overlap masks divergent root causes. The paper "Learning When to Remember: Risk-Sensitive Contextual Bandits for Abstention-Aware Memory Retrieval in LLM-Based Coding Agents" (2604.27283) reframes the issue as a risk-sensitive decision problem rather than a conventional top-k retrieval task. Unsafe memory utilization can irreversibly bias the agent’s repair trajectory, waste context resources, and amplify cascading errors—posing a critical safety risk that prior systems do not explicitly address.
RSCB-MC: System Design
The proposed Risk-Sensitive Contextual Bandit Memory Controller (RSCB-MC) operates as an intermediary between a memory retrieval subsystem and agent action selection. It does not simply choose the most similar memory but dynamically decides, per context, whether memory should influence agent reasoning at all. The primary design principles are:
- Pattern-Variant-Episode Schema: Memory is structured at three granularities—patterns (root-cause family), variants (contextualized fix strategies), and episodes (validated instances)—enabling fine-grained compatibility assessment beyond surface similarity.
- 16-Feature Contextual State: Each potential memory-use instance is encoded as a fixed-dimensional vector capturing retrieval certainty, failure family fit, feedback history, operational cost, and prior rejection/acceptance rates.
- Explicit Abstention and Non-Injection Actions: Actions such as abstain and no-memory are first-class, not fallback, and the system is intentionally biased to prefer abstention over potentially hazardous memory injections.
Risk and Reward Modeling
The reward design is explicitly risk-forward: false-positive memory injection incurs a larger penalty than the reward for correct reuse (γ>α>β). The action selection mechanism combines:
- Empirical action quality and adoption priors
- False-positive risk estimates, contextually regularized
- Token and latency costs, reflecting context budget constraints
The action score is formally
S(a,zt​)=R(a,zt​)−μPfp​(a,zt​)−αc(a,zt​)
with additional terms for exploration and calibrated contextual reward/risk modeling. Safety overrides enforce prohibitive policies in the presence of historically unsafe injection signals or ambiguous evidence.
Empirical Evaluation
Retrieval Robustness
On canonical error cases, deterministic retrieval saturates performance metrics (Recall@1/3, MRR, nDCG@3). However, these results overstate real-world robustness: paraphrased or hard-negative cases highlight that static retrieval approaches have high false-positive rates (up to 75%) due to inadequate structural and compatibility reasoning.
Safety and Abstention
RSCB-MC achieves 0.0% false-positive injection on hard-negative sets, in contrast to static and even baseline risk-sensitive bandits. Explicit ablation studies show that eliminating abstention increases false positives by 17.5% and sharply reduces cumulative reward. The controller is conservative—tending toward over-abstention on ambiguous cases—underscoring the sensitivity of the risk threshold in maximizing operational safety.
Offline Replay and Hot-Path Validation
In deterministic offline replay, RSCB-MC achieves the highest non-oracle replay success rate of 62.5% with zero false-positive memory usage. In a 200-case live hot-path validation, the system sustains 60.5% proxy success and 0.0% false positives at a p95 latency of 331 microseconds. Results empirically validate that memory-use safety, not retrieval accuracy, is the primary challenge in production-like settings.
Noteworthy claims:
- A risk-sensitive controller with abstention raises non-oracle success rates while keeping the false-positive injection rate at 0.0%.
- Removing abstention as a first-class action results in a marked increase in operational risk and cumulative reward collapse.
- Context-budget efficiency is non-monotonic: injecting more retrieved memory does not always increase utility due to token cost and increased opportunity for unsafe reuse.
Comparative Analysis and Implications
Contrast with Prior Retrieval-Augmented Generation (RAG): Existing RAG and knowledge-boundary approaches optimize retrieval for downstream answer quality but do not penalize operational hazards induced by faulty memory injection. The key divergence is that, in code agents, the cost of a wrong decision often manifests as hidden, cumulative operational errors rather than simple answer accuracy degradation. The pattern-variant-episode schema and exposure of operational features (e.g., command, path, stack compatibility) position RSCB-MC to address these coding agent-specific safety requirements.
Relation to Program Repair Systems: Established program-repair systems (e.g., RAP-Gen, ReAPR) assume that sufficiently similar bug-fix pairs are always useful for injection (Zhang et al., 29 Jun 2025, Yan et al., 27 Aug 2025). Empirical evidence from RSCB-MC demonstrates that this assumption is not safe, advocating for a gating layer preceding any prompt-level or code-editing influence by external recall.
Limitations and Operational Boundaries: All results are based on deterministic, offline, smoke-scale artifacts and proxy labels, not live, large-scale agent deployments. The controller does not address long-term memory management (aging, contradiction, or privacy), multi-turn credit assignment, or adaptive memory writing—these remain open integration points with richer RL-based memory systems (Yu et al., 5 Jan 2026). Over-conservatism in abstention also reveals calibration gaps.
Future Directions
Research should pursue:
- Full-scale, causal online LLM-agent evaluations to validate real-world impact on codebases and debuggability.
- Improved abstention calibration, balancing safety and utility.
- Integration with dynamic, write-path memory management for closed-loop robustness and aging handling.
- Joint context-budget and risk-aware optimization, given findings that risk-adjusted utility does not monotonically track with increased memory usage.
Conclusion
RSCB-MC operationalizes a risk-sensitive abstention-aware control layer for memory utilization in LLM-based coding agents. By shifting focus from similarity-based injection to selective, safety-prioritized memory use, it addresses a core deficiency in current memory-augmented LLM systems. Critically, empirical evidence underscores that the introduction of explicit abstention is essential for robust safety guarantees. Future advances should integrate online evaluation, more precise risk estimation, and tighter integration with adaptive memory management to realize fully dependable agent deployments.
Reference:
"Learning When to Remember: Risk-Sensitive Contextual Bandits for Abstention-Aware Memory Retrieval in LLM-Based Coding Agents" (2604.27283)