Papers
Topics
Authors
Recent
2000 character limit reached

MemoryGraft: Persistent Memory Injection Attack

Updated 22 December 2025
  • MemoryGraft is a persistent indirect injection attack that implants poisoned procedural experiences into an LLM's long-term memory, exploiting the semantic imitation heuristic.
  • It leverages dual retrieval mechanisms—lexical (BM25) and embedding (FAISS)—to amplify malicious influence, achieving up to approximately 48% poisoned recall.
  • Countermeasures such as cryptographic provenance attestation and constitutional consistency reranking are proposed to mitigate the risk of persistent behavioral drift.

MemoryGraft is a persistent, indirect injection attack targeting long-term memory and Retrieval-Augmented Generation (RAG) functionalities in LLM agents. Unlike transient prompt injections or factual RAG poisoning, MemoryGraft implants successful but malicious procedural experiences into the agent’s memory. The agent, relying on a semantic imitation heuristic, later retrieves these poisoned experiences and incorporates unsafe behavioral patterns into future actions. This attack leverages a critical, previously unprotected trust boundary between an agent's reasoning module and its own accumulated experience, enabling durable and stealthy behavioral compromise via benign-appearing ingestion artifacts (Srivastava et al., 18 Dec 2025).

1. Mechanisms of MemoryGraft in LLM Agent Architectures

Modern multi-agent systems such as MetaGPT link a core reasoning LLM (e.g., GPT-4o) with a RAG pipeline and a persistent long-term memory store. For any new task query qq, the agent retrieves up to kk past successful experiences using:

  • Lexical BM25 ranking: RetrMlex(q)\mathrm{Retr}^{\mathrm{lex}}_\mathcal{M}(q)
  • Embedding (FAISS, cosine) ranking: RetrMvec(q)\mathrm{Retr}^{\mathrm{vec}}_\mathcal{M}(q)

The overall retrieval pool is formed by union:

RetrM(q)=RetrMlex(q)RetrMvec(q)\mathrm{Retr}_\mathcal{M}(q) = \mathrm{Retr}^{\mathrm{lex}}_\mathcal{M}(q) \cup \mathrm{Retr}^{\mathrm{vec}}_\mathcal{M}(q)

These experiences are prepended to the input prompt as demonstrations. Once a generated response RqR_q succeeds, this (q,Rq)(q, R_q) pair is appended to the persistent memory M\mathcal{M}, completing the retrieval–write loop for adaptive agent behavior.

MemoryGraft exploits this design by inserting poisoned experience records that reside in semantically central regions of the retrieval manifold. The agent’s subsequent reliance on “trusted” retrieved demonstrations—the semantic imitation heuristic—leads to persistent behavioral drift if these grafted records are recalled (Srivastava et al., 18 Dec 2025).

2. Threat Model, Attacker Capabilities, and Security Objectives

The adversary, denoted Aadv\mathcal{A}_{\mathrm{adv}}, is restricted to submitting benign-appearing ingestion artifacts (e.g., markdown/README files, code snippets, or notes) during ordinary agent operation. The attacker cannot directly modify the long-term memory store, parameters, or core codebase, nor can they observe other user queries.

The attacker’s principal goals are the following:

  1. Poisoned Retrieval: For a clean query qq^*, at least one poisoned entry appears in RetrM(q)\mathrm{Retr}_\mathcal{M}(q^*).
  2. Behavioral Drift: The agent adopts an unsafe pattern π\pi (e.g., skipped data validation) in its output for a benign query.
  3. Persistence: Malicious influence endures across agent sessions, requiring no sustained attack after initial poisoning (Srivastava et al., 18 Dec 2025).

3. Attack Pipeline: Construction, Deployment, and Activation

The MemoryGraft attack pipeline consists of the following stages:

  • Seed Construction: The adversary assembles two experience sets:
    • SbenignS_{\mathrm{benign}}: hundreds of standard procedural pairs (e.g., nb=100n_b = 100),
    • SpoisonS_{\mathrm{poison}}: a small set of malicious procedural pairs (np=10n_p = 10), encoding unsafe behaviors but framed as validated/safe practices.
  • Ingestion Artifact Creation: The attacker produces an innocuous markdown note N\mathcal{N} (example: rag_poisoned_notes.md) containing a Python code block to merge SbenignS_{\mathrm{benign}} and SpoisonS_{\mathrm{poison}} into a new long-term memory index using both BM25 and FAISS. This ingestion artifact is processed by the agent during routine operation:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    def build_store(S_benign, S_poison):
        M = []
        for (q, R) in S_benign + S_poison:
            M.append({"query": q, "trace": R})
        bm25 = BM25(M)
        vec = FAISS(M)
        save_index(bm25, "results/rag_poison_store/bm25")
        save_index(vec,  "results/rag_poison_store/faiss")
        return M
    build_store(S_benign, S_poison)
  • Poisoning Phase: The agent, upon reading N\mathcal{N}, executes this logic, populating the persistent memory store with nb+npn_b + n_p experiences, now including the malicious seed set.
  • Union Retrieval at Inference: For each new query qq, union retrieval (kk top results from each of BM25 and FAISS) surfaces poisoned records. These records, designed for high semantic and lexical similarity, are likely to be recalled even at small npn_p.
  • Semantic Imitation and Behavioral Compromise: On retrieving SpoisonS_{\mathrm{poison}} instances, the agent mimics unsafe procedures, effecting persistent behavioral drift (Srivastava et al., 18 Dec 2025).

4. Empirical Evaluation and Quantitative Results

The attack was validated on MetaGPT’s DataInterpreter agent using GPT-4o as the backbone, with retrieval k=3k=3 for both BM25 and FAISS, and 110 total experience seeds (100 benign, 10 poisoned). Across 12 benign evaluation queries, key figures include:

Metric Value
Total Retrievals (TtotT_{tot}) 48
Poisoned Hits (PtotP_{tot}) 23
Poisoned Retrieval Proportion (PRP) 47.9%\approx 47.9\%

The poisoned experience pool (np=10n_p = 10) constituted less than 10% of the memory, yet accounted for nearly 48% of all retrievals in subsequent agent workloads.

Behavioral drift was directly observed: the agent, when prompted with innocuous data-analysis queries, generated outputs exhibiting the poisoned shortcuts (e.g., skipped validations, insecure shell commands) without explicit adversarial prompts. Retrieval threshold ablations indicated PRP scales with kk; increasing kk from 1 to 5 raised PRP from approximately 32% to 55%. Both pure lexical and pure embedding retrieval strategies admitted significant poisoned recall (36–40%), with the union approach peaking at 50% (Srivastava et al., 18 Dec 2025).

5. Analysis: Amplification Mechanisms and Heuristic Vulnerabilities

Despite the minority of seeded poison (npnbn_p \ll n_b), union retrieval and dual-channel ranking severely amplify the attack. Poisoned instances, crafted to be semantically and lexically central, are likely to surface via either BM25 or FAISS, needing only one channel to achieve prompt recall.

The retrieval score can be formalized as:

s(d,q)=αslex(d,q)+(1α)sembed(d,q)s(d, q) = \alpha \cdot s_{\text{lex}}(d, q) + (1-\alpha) \cdot s_{\text{embed}}(d, q)

with both components maximized for the poisoned seeds. This reveals that existing retrieval logic, in the absence of provenance and risk checks, exposes LLM agents to strategic memory grafting attacks.

Union retrieval and low kk thresholds are principal amplifiers; increasing kk or utilizing even partially mixed similarity metrics leads to outsized poisoned recall relative to the fraction of tainted records (Srivastava et al., 18 Dec 2025).

6. Proposed Defenses, Limitations, and Future Challenges

Two primary countermeasures are outlined:

  1. Cryptographic Provenance Attestation: The agent signs every legitimate memory entry with a private key (KprivK_{\text{priv}}), storing (q,Rq,σ)(q, R_q, \sigma) with signature σ=Sign(H(qRq),Kpriv)\sigma = \mathrm{Sign}(H(q \Vert R_q), K_{\text{priv}}). Verification with a public key (KpubK_{\text{pub}}) on retrieval blocks untrusted insertions.
  2. Constitutional Consistency Reranking: A compact internal “safety constitution” C\mathcal{C} enables risk scoring and suppression:

S(q,qi)=cos(eq,eqi)βLrisk(RiC)S(q, q_i) = \cos(e_q, e_{q_i}) - \beta \cdot \mathcal{L}_{\text{risk}}(R_i \mid \mathcal{C})

Rejecting recalled experiences with Lrisk>τ\mathcal{L}_{\text{risk}} > \tau suppresses high-risk behavioral patterns.

Limitations include the semi-white-box experimental context; full black-box agent deployments complicate trigger prediction. The attack's evaluation was limited to a single agent, and multi-agent workflows might modulate poisoning dynamics. Comprehensive metrics for downstream systemic severity remain undeveloped, and cryptographically robust key management for decentralized open-source frameworks is an open engineering area (Srivastava et al., 18 Dec 2025).

In conclusion, MemoryGraft exposes a previously unrecognized, high-impact attack surface in experience-augmented LLM agents. By exploiting the trust boundary between reasoning core and agent past, and leveraging the union retrieval heuristic, minimal poisoned seed insertion results in persistent, stealthy compromise unless counteracted by cryptographic or constitutional screening.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to MemoryGraft.