Structured Auditable Natural-Language Memory
- Structured Auditable Natural-Language Memory is an architectural paradigm combining explicit, queryable memory with detailed audit logs for traceability.
- It integrates fact extraction, canonicalization, and semantic retrieval to mitigate hallucinations and support robust evidence tracking.
- The system enables safe, documented updates and employs rigorous evaluation metrics to ensure efficiency and self-correcting behavior in AI models.
Structured Auditable Natural-Language Memory refers to architectures and methodologies that enable LLMs and agents to store, retrieve, and update factual, procedural, and episodic information in formats that are both structured (machine-parseable, queryable) and fully auditable (recording provenance, operations, and supporting traceability). These systems are designed to address the limitations of flat token-based models, mitigate hallucination, support evidence tracking, and enable safe, efficient, and transparent updates across diverse application domains.
1. Conceptual Foundations and Taxonomy
Structured auditable natural-language memory (SANLM) systems support explicit, externally-addressable memories, augmenting or replacing traditional parametric (weight-based) LLM memory. Under a unified definition, any persistent state written during pretraining, fine-tuning, or inference that can later be addressed and stably influence outputs is considered "memory" (Zhang et al., 23 Sep 2025).
The "memory quadruple" formalizes memory instances along four axes: location (), persistence (), write/access path (), and controllability ():
| Type | Location () | Persistence () | Write/Access Path & Controllability |
|---|---|---|---|
| Parametric | Model parameters (FFNs) | Permanent | Gradients, low-controllability |
| Contextual | KV cache (model activations) | Single inference turn | Self-attention, read-only |
| External | Store (e.g. vector DB, RDF graph) | Updatable | Retrieval fusion, high controllability |
| Procedural/Episodic | Event/timeline DB (JSON, SQL) | Multi-session | Write-to-log, replay, medium controllability |
Fully auditable memory requires instrumentation of every write, read, update, and inhibition event in versioned, machine-readable logs, with schema including action type, timestamp, memory IDs, provenance, and evaluation metrics (Zhang et al., 23 Sep 2025).
2. Structured Fact Extraction and Canonicalization
Systems such as SMART rely on hierarchical, syntax-aware extraction of canonical facts from technical documents. The extractor ("Grammarian") employs a Tree-LSTM to parse sentences or table rows, producing embeddings of subject, relation, and object spans. These are mapped to 128-dimensional vectors and concatenated:
where each and is the span embedding from the Tree-LSTM. These fact vectors, together with precise provenance metadata (e.g., docID, passageID, charOffset), constitute the atomic units in the structured memory matrix (Dudeja et al., 24 Dec 2025).
Alternative architectures extract and normalize facts as RDF triples , where entities and predicates are canonicalized via unique URIs and lexical resources. All memory operations—INSERT, DELETE, UPDATE, QUERY—are logged in append-only tables, indexed for reconstructable version history (Saha, 7 Jul 2025).
3. Memory Representation, Indexing, and Operations
Structured memory is instantiated as a high-dimensional vector store or graph database to support rapid retrieval, update, and merges. Key primitives (as formalized in Text2Mem (Wang et al., 14 Sep 2025)) include:
- Encode: Store a new memory (raw text, URL, or structured object) with optional embedding.
- Retrieve: Select items by IDs, tags, or semantic similarity.
- Delete: Remove items, with "hard" (irreversible) or "soft" (status-flag) modes; lock invariants prevent hard deletion of locked entries.
- Merge, Split, Promote, Demote: Manipulate memory structures, preserving lineage and enforcing schema invariants. For example:
- Lock, Expire: Restrict or automatically retire memory items based on explicit conditions (e.g., expiry TTL, read/write lock status).
All operations must pass schema validation and cross-field invariant checks prior to execution, producing a unified execution contract and a log of side effects for auditability (Wang et al., 14 Sep 2025).
4. Auditing, Provenance, and Self-Correction
Auditability is enforced through systematic provenance tracking at all stages:
- Every memory record incorporates detailed metadata such as source identifiers, version numbers, previous IDs, and fine-grained provenance (file, line, prompt) (Ganguli et al., 8 May 2025, Dudeja et al., 24 Dec 2025).
- Every memory operation appends structured logs encompassing operation type, affected memory IDs, timestamp, and user/agent identity (Saha, 7 Jul 2025, Ganguli et al., 8 May 2025).
- Correction layers—such as knowledge-aware self-correction—detect inconsistencies between LLM outputs and memory graphs (e.g., RDF stores), automatically substituting correct canonical values where low-confidence errors are detected. All corrections and their provenance (triple ID, timestamp, correction event) are audited (Saha, 7 Jul 2025).
SMART and similar architectures append provenance tables to generated answers, enabling claims to be traced directly to document offsets. Versioned RDF stores and event log replay enable full recoverability of memory state at any past time, supporting external audit and forensic analysis (Dudeja et al., 24 Dec 2025, Saha, 7 Jul 2025).
5. Retrieval, Fusion, and Answer Generation
Upon query, memory systems perform retrieval using semantic similarity (e.g., cosine distance in embedding space) or graph-based selection (e.g., SPARQL or Cypher queries). Retrieved fact vectors or episodic fragments are fused in downstream model heads (e.g., 6-layer Transformer with gated multihead attention in SMART):
where is a learnable gate and is the aggregated retrieved memory context (Dudeja et al., 24 Dec 2025). In narrative-endowed memory systems, episodic retrieval is conducted via LLM prompts selecting among plot headlines, while semantic retrieval is mediated through graph queries (Zhou et al., 9 Jan 2026).
Self-correction layers (e.g., (Saha, 7 Jul 2025)) operate in post-processing, extracting triples from outputs, checking for inconsistencies, and applying targeted corrections. Each correction is logged with original and corrected values and provenance information.
6. Evaluation Metrics and Performance
Evaluation of SANLM systems is protocol-driven and layered by memory type (Zhang et al., 23 Sep 2025):
- Parametric: Closed-book recall, edit differential, privacy risk.
- Contextual: Position-curves, mid-span performance drop.
- External/Procedural: Recall@k, nDCG@k, FActScore (fraction of supported claims), citation precision/recall, cross-session consistency.
- Timeliness: Freshness-hit rates, outdated answer rates.
- Uncertainty: Inter-rater agreement coefficients, risk-coverage curves.
- Operational: Execution time, latency, schema/command fidelity.
SMART—using only 45.51M parameters (64% fewer than GPT-2)—achieves a 21.3% higher question-answering accuracy than GPT-2 on engineering manuals, with marked reductions in unsupported assertions due to explicit memory lookup (Dudeja et al., 24 Dec 2025). Episodic architectures supported by narrative grouping and momentum-based consolidation achieve higher coverage and lower latency than embedding-only retrievers (e.g., >90% memory coverage at narratives, 50% latency reduction, and superior J-scores on the LOCOMO benchmark (Zhou et al., 9 Jan 2026)).
7. Governance, Versioning, and Safe Updates
Change management and governance frameworks are tightly integrated into auditable memory architectures. The DMM-Gov framework (Zhang et al., 23 Sep 2025) enforces:
- Admission thresholds for updates and edits (Edit Success Rate, Locality, Drawdown).
- Gray rollout with live monitoring.
- Automatic rollback if live metrics breach safety or fidelity thresholds.
- Issuance of audit certificates capturing the version ID, evidence set, update path, and rollback instructions.
All reads, writes, and updates—including re-indexing, PEFT, DAPT, and model-editing—are logged with full context, enabling post-hoc validation, rollback, and provenance reconstruction.
In summary, structured auditable natural-language memory systems represent an architectural paradigm enabling LLMs to ground outputs in explicit, traceable, and updatable memory substrates. They support robust fact extraction, canonicalization, memory-oriented reasoning, dynamic correction, operational transparency, and empirical reproducibility across diverse deployment contexts (Dudeja et al., 24 Dec 2025, Zhang et al., 23 Sep 2025, Wang et al., 14 Sep 2025, Ganguli et al., 8 May 2025, Saha, 7 Jul 2025, Gonzalez et al., 1 Nov 2025, Zhou et al., 9 Jan 2026).