Papers
Topics
Authors
Recent
2000 character limit reached

Structured Auditable Natural-Language Memory

Updated 4 February 2026
  • Structured Auditable Natural-Language Memory is an architectural paradigm combining explicit, queryable memory with detailed audit logs for traceability.
  • It integrates fact extraction, canonicalization, and semantic retrieval to mitigate hallucinations and support robust evidence tracking.
  • The system enables safe, documented updates and employs rigorous evaluation metrics to ensure efficiency and self-correcting behavior in AI models.

Structured Auditable Natural-Language Memory refers to architectures and methodologies that enable LLMs and agents to store, retrieve, and update factual, procedural, and episodic information in formats that are both structured (machine-parseable, queryable) and fully auditable (recording provenance, operations, and supporting traceability). These systems are designed to address the limitations of flat token-based models, mitigate hallucination, support evidence tracking, and enable safe, efficient, and transparent updates across diverse application domains.

1. Conceptual Foundations and Taxonomy

Structured auditable natural-language memory (SANLM) systems support explicit, externally-addressable memories, augmenting or replacing traditional parametric (weight-based) LLM memory. Under a unified definition, any persistent state SS written during pretraining, fine-tuning, or inference that can later be addressed and stably influence outputs is considered "memory" (Zhang et al., 23 Sep 2025).

The "memory quadruple" formalizes memory instances along four axes: location (ℓ\ell), persistence (τ\tau), write/access path (π\pi), and controllability (κ\kappa):

Type Location (â„“\ell) Persistence (Ï„\tau) Write/Access Path & Controllability
Parametric Model parameters (FFNs) Permanent Gradients, low-controllability
Contextual KV cache (model activations) Single inference turn Self-attention, read-only
External Store (e.g. vector DB, RDF graph) Updatable Retrieval fusion, high controllability
Procedural/Episodic Event/timeline DB (JSON, SQL) Multi-session Write-to-log, replay, medium controllability

Fully auditable memory requires instrumentation of every write, read, update, and inhibition event in versioned, machine-readable logs, with schema including action type, timestamp, memory IDs, provenance, and evaluation metrics (Zhang et al., 23 Sep 2025).

2. Structured Fact Extraction and Canonicalization

Systems such as SMART rely on hierarchical, syntax-aware extraction of canonical facts from technical documents. The extractor ("Grammarian") employs a Tree-LSTM to parse sentences or table rows, producing embeddings of subject, relation, and object spans. These are mapped to 128-dimensional vectors and concatenated:

m=[vs∣vr∣vo]∈R384m = [v_s \mid v_r \mid v_o] \in \mathbb{R}^{384}

where each vx=GELU(Wx hspanx+bx)v_x = \mathrm{GELU}(W_x\,h_{\mathrm{span}_x} + b_x) and hspanxh_{\mathrm{span}_x} is the span embedding from the Tree-LSTM. These fact vectors, together with precise provenance metadata (e.g., docID, passageID, charOffset), constitute the atomic units in the structured memory matrix M∈RN×384M \in \mathbb{R}^{N \times 384} (Dudeja et al., 24 Dec 2025).

Alternative architectures extract and normalize facts as RDF triples (s,p,o)(s,p,o), where entities and predicates are canonicalized via unique URIs and lexical resources. All memory operations—INSERT, DELETE, UPDATE, QUERY—are logged in append-only tables, indexed for reconstructable version history (Saha, 7 Jul 2025).

3. Memory Representation, Indexing, and Operations

Structured memory is instantiated as a high-dimensional vector store or graph database to support rapid retrieval, update, and merges. Key primitives (as formalized in Text2Mem (Wang et al., 14 Sep 2025)) include:

  • Encode: Store a new memory (raw text, URL, or structured object) with optional embedding.
  • Retrieve: Select items by IDs, tags, or semantic similarity.
  • Delete: Remove items, with "hard" (irreversible) or "soft" (status-flag) modes; lock invariants prevent hard deletion of locked entries.
  • Merge, Split, Promote, Demote: Manipulate memory structures, preserving lineage and enforcing schema invariants. For example:

∀i∈scope(Delete):locked(i)  ⟹  mode≠"hard"\forall i \in \mathrm{scope}(\text{Delete}): \mathrm{locked}(i) \implies \text{mode} \neq \text{"hard"}

Merge:∣scope∣≥2∧lineage_preserved\text{Merge}: |\mathrm{scope}| \geq 2 \wedge \text{lineage\_preserved}

  • Lock, Expire: Restrict or automatically retire memory items based on explicit conditions (e.g., expiry TTL, read/write lock status).

All operations must pass schema validation and cross-field invariant checks prior to execution, producing a unified execution contract and a log of side effects for auditability (Wang et al., 14 Sep 2025).

4. Auditing, Provenance, and Self-Correction

Auditability is enforced through systematic provenance tracking at all stages:

  • Every memory record incorporates detailed metadata such as source identifiers, version numbers, previous IDs, and fine-grained provenance (file, line, prompt) (Ganguli et al., 8 May 2025, Dudeja et al., 24 Dec 2025).
  • Every memory operation appends structured logs encompassing operation type, affected memory IDs, timestamp, and user/agent identity (Saha, 7 Jul 2025, Ganguli et al., 8 May 2025).
  • Correction layers—such as knowledge-aware self-correction—detect inconsistencies between LLM outputs and memory graphs (e.g., RDF stores), automatically substituting correct canonical values where low-confidence errors are detected. All corrections and their provenance (triple ID, timestamp, correction event) are audited (Saha, 7 Jul 2025).

SMART and similar architectures append provenance tables to generated answers, enabling claims to be traced directly to document offsets. Versioned RDF stores and event log replay enable full recoverability of memory state at any past time, supporting external audit and forensic analysis (Dudeja et al., 24 Dec 2025, Saha, 7 Jul 2025).

5. Retrieval, Fusion, and Answer Generation

Upon query, memory systems perform retrieval using semantic similarity (e.g., cosine distance in embedding space) or graph-based selection (e.g., SPARQL or Cypher queries). Retrieved fact vectors or episodic fragments are fused in downstream model heads (e.g., 6-layer Transformer with gated multihead attention in SMART):

Hfused(l)=g(l)Hself(l)+(1−g(l))[1⊗cmem]H^{(l)}_{\text{fused}} = g^{(l)} H^{(l)}_{\text{self}} + (1-g^{(l)}) [\mathbf{1} \otimes c_{\mathrm{mem}}]

where g(l)g^{(l)} is a learnable gate and cmemc_{\mathrm{mem}} is the aggregated retrieved memory context (Dudeja et al., 24 Dec 2025). In narrative-endowed memory systems, episodic retrieval is conducted via LLM prompts selecting among plot headlines, while semantic retrieval is mediated through graph queries (Zhou et al., 9 Jan 2026).

Self-correction layers (e.g., (Saha, 7 Jul 2025)) operate in post-processing, extracting triples from outputs, checking for inconsistencies, and applying targeted corrections. Each correction is logged with original and corrected values and provenance information.

6. Evaluation Metrics and Performance

Evaluation of SANLM systems is protocol-driven and layered by memory type (Zhang et al., 23 Sep 2025):

  • Parametric: Closed-book recall, edit differential, privacy risk.
  • Contextual: Position-curves, mid-span performance drop.
  • External/Procedural: Recall@k, nDCG@k, FActScore (fraction of supported claims), citation precision/recall, cross-session consistency.
  • Timeliness: Freshness-hit rates, outdated answer rates.
  • Uncertainty: Inter-rater agreement coefficients, risk-coverage curves.
  • Operational: Execution time, latency, schema/command fidelity.

SMART—using only 45.51M parameters (64% fewer than GPT-2)—achieves a 21.3% higher question-answering accuracy than GPT-2 on engineering manuals, with marked reductions in unsupported assertions due to explicit memory lookup (Dudeja et al., 24 Dec 2025). Episodic architectures supported by narrative grouping and momentum-based consolidation achieve higher coverage and lower latency than embedding-only retrievers (e.g., >90% memory coverage at k=4k=4 narratives, 50% latency reduction, and superior J-scores on the LOCOMO benchmark (Zhou et al., 9 Jan 2026)).

7. Governance, Versioning, and Safe Updates

Change management and governance frameworks are tightly integrated into auditable memory architectures. The DMM-Gov framework (Zhang et al., 23 Sep 2025) enforces:

  • Admission thresholds for updates and edits (Edit Success Rate, Locality, Drawdown).
  • Gray rollout with live monitoring.
  • Automatic rollback if live metrics breach safety or fidelity thresholds.
  • Issuance of audit certificates capturing the version ID, evidence set, update path, and rollback instructions.

All reads, writes, and updates—including re-indexing, PEFT, DAPT, and model-editing—are logged with full context, enabling post-hoc validation, rollback, and provenance reconstruction.


In summary, structured auditable natural-language memory systems represent an architectural paradigm enabling LLMs to ground outputs in explicit, traceable, and updatable memory substrates. They support robust fact extraction, canonicalization, memory-oriented reasoning, dynamic correction, operational transparency, and empirical reproducibility across diverse deployment contexts (Dudeja et al., 24 Dec 2025, Zhang et al., 23 Sep 2025, Wang et al., 14 Sep 2025, Ganguli et al., 8 May 2025, Saha, 7 Jul 2025, Gonzalez et al., 1 Nov 2025, Zhou et al., 9 Jan 2026).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Structured Auditable Natural-Language Memory.