Learning When to Remember: Risk-Sensitive Contextual Bandits for Abstention-Aware Memory Retrieval in LLM-Based Coding Agents

Published 30 Apr 2026 in cs.CL, cs.AI, and cs.LG | (2604.27283v1)

Abstract: LLM-based coding agents increasingly rely on external memory to reuse prior debugging experience, repair traces, and repository-local operational knowledge. However, retrieved memory is useful only when the current failure is genuinely compatible with a previous one; superficial similarity in stack traces, terminal errors, paths, or configuration symptoms can lead to unsafe memory injection. This paper reframes issue-memory use as a selective, risk-sensitive control problem rather than a pure top-k retrieval problem. We introduce RSCB-MC, a risk-sensitive contextual bandit memory controller that decides whether an agent should use no memory, inject the top resolution, summarize multiple candidates, perform high-precision or high-recall retrieval, abstain, or ask for feedback. The system stores reusable issue knowledge through a pattern-variant-episode schema and converts retrieval evidence into a fixed 16-feature contextual state capturing relevance, uncertainty, structural compatibility, feedback history, false-positive risk, latency, and token cost. Its reward design penalizes false-positive memory injection more strongly than missed reuse, making non-injection and abstention first-class safety actions. In deterministic smoke-scale artifacts, RSCB-MC obtains the strongest non-oracle offline replay success rate, 62.5%, while maintaining a 0.0% false-positive rate. In a bounded 200-case hot-path validation, it reaches 60.5% proxy success with 0.0% false positives and a 331.466 microseconds p95 decision latency. The results show that, for coding-agent memory, the key question is not only which memory is most similar, but whether any retrieved memory is safe enough to influence the debugging trajectory.

Abstract PDF Upgrade to Chat

Authors (1)

Mehmet Iscan

Summary

The paper introduces RSCB-MC, a risk-sensitive memory controller that actively abstains from unsafe memory retrieval to enhance LLM coding agent safety.
It employs a pattern-variant-episode schema and a 16-feature contextual state for fine-grained compatibility assessment, achieving 0% false-positive injections.
Empirical results demonstrate improved non-oracle success rates and reduced operational risk, underscoring the need for safety-centric memory usage over conventional retrieval.

Risk-Sensitive Contextual Bandit Memory Control for LLM-Based Coding Agents

Problem Formulation and Motivation

LLM-based coding agents rely increasingly on memory systems to retrieve historical debugging experiences and operational knowledge. However, superficial retrieval based on lexical or surface-level similarity often leads to the unsafe injection of past solutions, particularly in complex debugging tasks where symptom overlap masks divergent root causes. The paper "Learning When to Remember: Risk-Sensitive Contextual Bandits for Abstention-Aware Memory Retrieval in LLM-Based Coding Agents" (2604.27283) reframes the issue as a risk-sensitive decision problem rather than a conventional top- $k$ retrieval task. Unsafe memory utilization can irreversibly bias the agent’s repair trajectory, waste context resources, and amplify cascading errors—posing a critical safety risk that prior systems do not explicitly address.

RSCB-MC: System Design

The proposed Risk-Sensitive Contextual Bandit Memory Controller (RSCB-MC) operates as an intermediary between a memory retrieval subsystem and agent action selection. It does not simply choose the most similar memory but dynamically decides, per context, whether memory should influence agent reasoning at all. The primary design principles are:

Pattern-Variant-Episode Schema: Memory is structured at three granularities—patterns (root-cause family), variants (contextualized fix strategies), and episodes (validated instances)—enabling fine-grained compatibility assessment beyond surface similarity.
16-Feature Contextual State: Each potential memory-use instance is encoded as a fixed-dimensional vector capturing retrieval certainty, failure family fit, feedback history, operational cost, and prior rejection/acceptance rates.
Explicit Abstention and Non-Injection Actions: Actions such as abstain and no-memory are first-class, not fallback, and the system is intentionally biased to prefer abstention over potentially hazardous memory injections.

Risk and Reward Modeling

The reward design is explicitly risk-forward: false-positive memory injection incurs a larger penalty than the reward for correct reuse ( $\gamma > \alpha > \beta$ ). The action selection mechanism combines:

Empirical action quality and adoption priors
False-positive risk estimates, contextually regularized
Token and latency costs, reflecting context budget constraints

The action score is formally

$S(a, z_t) = R(a, z_t) - \mu \mathrm{P}_{\text{fp}}(a, z_t) - \alpha c(a, z_t)$

with additional terms for exploration and calibrated contextual reward/risk modeling. Safety overrides enforce prohibitive policies in the presence of historically unsafe injection signals or ambiguous evidence.

Empirical Evaluation

Retrieval Robustness

On canonical error cases, deterministic retrieval saturates performance metrics (Recall@1/3, MRR, nDCG@3). However, these results overstate real-world robustness: paraphrased or hard-negative cases highlight that static retrieval approaches have high false-positive rates (up to 75%) due to inadequate structural and compatibility reasoning.

Safety and Abstention

RSCB-MC achieves 0.0% false-positive injection on hard-negative sets, in contrast to static and even baseline risk-sensitive bandits. Explicit ablation studies show that eliminating abstention increases false positives by 17.5% and sharply reduces cumulative reward. The controller is conservative—tending toward over-abstention on ambiguous cases—underscoring the sensitivity of the risk threshold in maximizing operational safety.

Offline Replay and Hot-Path Validation

In deterministic offline replay, RSCB-MC achieves the highest non-oracle replay success rate of 62.5% with zero false-positive memory usage. In a 200-case live hot-path validation, the system sustains 60.5% proxy success and 0.0% false positives at a p95 latency of 331 microseconds. Results empirically validate that memory-use safety, not retrieval accuracy, is the primary challenge in production-like settings.

Noteworthy claims:

A risk-sensitive controller with abstention raises non-oracle success rates while keeping the false-positive injection rate at 0.0%.
Removing abstention as a first-class action results in a marked increase in operational risk and cumulative reward collapse.
Context-budget efficiency is non-monotonic: injecting more retrieved memory does not always increase utility due to token cost and increased opportunity for unsafe reuse.

Comparative Analysis and Implications

Contrast with Prior Retrieval-Augmented Generation (RAG): Existing RAG and knowledge-boundary approaches optimize retrieval for downstream answer quality but do not penalize operational hazards induced by faulty memory injection. The key divergence is that, in code agents, the cost of a wrong decision often manifests as hidden, cumulative operational errors rather than simple answer accuracy degradation. The pattern-variant-episode schema and exposure of operational features (e.g., command, path, stack compatibility) position RSCB-MC to address these coding agent-specific safety requirements.

Relation to Program Repair Systems: Established program-repair systems (e.g., RAP-Gen, ReAPR) assume that sufficiently similar bug-fix pairs are always useful for injection (Zhang et al., 29 Jun 2025, Yan et al., 27 Aug 2025). Empirical evidence from RSCB-MC demonstrates that this assumption is not safe, advocating for a gating layer preceding any prompt-level or code-editing influence by external recall.

Limitations and Operational Boundaries: All results are based on deterministic, offline, smoke-scale artifacts and proxy labels, not live, large-scale agent deployments. The controller does not address long-term memory management (aging, contradiction, or privacy), multi-turn credit assignment, or adaptive memory writing—these remain open integration points with richer RL-based memory systems (Yu et al., 5 Jan 2026). Over-conservatism in abstention also reveals calibration gaps.

Future Directions

Research should pursue:

Full-scale, causal online LLM-agent evaluations to validate real-world impact on codebases and debuggability.
Improved abstention calibration, balancing safety and utility.
Integration with dynamic, write-path memory management for closed-loop robustness and aging handling.
Joint context-budget and risk-aware optimization, given findings that risk-adjusted utility does not monotonically track with increased memory usage.

Conclusion

RSCB-MC operationalizes a risk-sensitive abstention-aware control layer for memory utilization in LLM-based coding agents. By shifting focus from similarity-based injection to selective, safety-prioritized memory use, it addresses a core deficiency in current memory-augmented LLM systems. Critically, empirical evidence underscores that the introduction of explicit abstention is essential for robust safety guarantees. Future advances should integrate online evaluation, more precise risk estimation, and tighter integration with adaptive memory management to realize fully dependable agent deployments.

Reference:

"Learning When to Remember: Risk-Sensitive Contextual Bandits for Abstention-Aware Memory Retrieval in LLM-Based Coding Agents" (2604.27283)

Markdown Report Issue