LatentGraphMem: Scalable Graph Memory
- LatentGraphMem is a hybrid memory architecture that leverages latent graph embeddings with symbolic subgraph retrieval to support interpretable and low-latency long-horizon reasoning.
- It segments texts into overlapping chunks and converts edge relations into latent embeddings, ensuring stable streaming updates and predictable inference times.
- Empirical evaluations show superior performance over traditional methods, while highlighting future prospects in enhancing extraction accuracy and adaptive subgraph budgets.
LatentGraphMem is a memory framework for LLMs designed to enable efficient, interpretable, and robust long-horizon reasoning across vast contexts in question-answering tasks. It combines an implicit graph-structured memory in latent space with an explicit subgraph retrieval interface, addressing the bottlenecks faced by both explicit graph memories (which provide interpretability but are brittle at scale) and latent memory schemes (which are efficient but lack transparency). LatentGraphMem builds and stores the knowledge graph as latent embeddings and retrieves only a fixed-budget, symbolic subgraph relevant to the input query. This architecture supports scalable streaming updates, interpretable memory for downstream reasoning, and parameter-efficient adaptation while maintaining predictable inference latency regardless of the input context length (Zhang et al., 6 Jan 2026).
1. Architectural Paradigm and Motivation
LatentGraphMem was motivated by persistent challenges in long-context question answering where evidence is sparse and distributed across extended texts. Prior memory approaches fall into two major paradigms: explicit graph-based memories (such as entity-relation stores) and latent vector memories (such as soft token or embedding tables). The former are interpretable and externally inspectable but degrade sharply on long documents due to structure induction and retrieval failures. The latter remain robust over lengthy contexts but forego interpretability and controllability.
LatentGraphMem reconciles these trade-offs by maintaining all knowledge as a graph in latent embedding space for efficient, stable storage, while exposing an explicit, symbolic subgraph—selected by a retrieval module—to the downstream LLM reasoner. This compact subgraph can be inspected by humans under a fixed evidence budget.
2. Latent Graph Memory Construction
Input documents are segmented into overlapping chunks of at most tokens with overlap . A graph builder module extracts relational triples from each chunk, incrementally constructing a full explicit graph up to a global capacity . The explicit graph is merged, canonicalized, filtered for schema compliance, and capped in size.
Each retained edge is embedded into a -dimensional latent vector . The latent memory 0 is thus a matrix of all edge embeddings, supporting stable, streaming memory updates. Although pairwise edge scores and Laplacian-based regularization are possible, the current implementation uses only the budget cap for implicit regularization.
3. Task-Specific Subgraph Retrieval
At inference, a subgraph retriever 1 encodes the query 2 to a vector 3. Each edge embedding 4 is scored against the query vector using a bilinear form 5 where 6 is learned. The top 7-scoring edges are selected under a fixed retrieval budget: 3 During backpropagation, a softmax-based straight-through estimator enables gradient flow. The selected subgraph 8 is serialized into a compact symbolic format (e.g., "Relevant Knowledge: [h|r|t] …"), serving as the only externalized content passed to the frozen LLM reasoner.
4. Training Regimen
LatentGraphMem is trained in three stages with the reasoner held frozen:
- Stage I (Full-Graph Construction): The builder module extracts the explicit full graph, serializes it, and supervision is provided by cross-entropy loss between the LLM-generated answer and ground truth given the entire graph and query.
- Stage II (Latent Subgraph Retrieval): With graph extraction weights frozen, the retriever is trained to select subgraphs of size 9, minimizing the same loss but with only the retrieved subgraph provided.
- Stage III (Joint Fine-Tuning): Alternates between optimizing full-graph builder steps and joint builder-retriever steps, balancing extraction quality and retrieval efficacy.
All training utilizes QA-style cross-entropy loss routed through the frozen LLM reasoner, with gradients backpropagated via the straight-through TopK operator.
5. Inference Pipeline and Computational Complexity
At inference, the builder parses the document and forms the latent edge memory. The retriever encodes the query, scores and selects the top 0 edges, and serializes the explicit subgraph for input to the reasoner, which generates the answer. Because only 1 selected edges are included in the prompt, inference time depends on 2, not on the document length 3. This yields stable, low-latency inference even for very long contexts.
Complexity per module:
- Builder: 4, streaming over document length.
- Retriever: 5 per query, with 6 capped.
- LLM Reasoner: Scales with 7, typically 8.
6. Empirical Evaluation
LatentGraphMem's effectiveness is demonstrated on a diverse suite of long-context QA benchmarks, with training on TriviaQA, QASPER, and QuALITY (about 20K instances), and evaluation on HotpotQA (1K), NarrativeQA (800), and WikiHop (800). Qwen2.5-1.5B, SmolLM3-3B, and Qwen3-8B serve as frozen LLM reasoners. Baselines include retrieval-augmented generation (RAG), explicit-graph models (THEANINE, PREMem, Mem0, A-Mem), and the latent memory MemGen.
Main results (average accuracy, three tasks):
| Backbone | Reasoner | MemGen | LatentGraphMem |
|---|---|---|---|
| 1.5B parameters | 52.1 | 44.0 | 56.1 |
| 3B parameters | 50.7 | 49.6 | 58.6 |
| 8B parameters | 54.7 | 54.6 | 63.3 |
LatentGraphMem outperforms both explicit-graph and latent-memory baselines at all scales, with the largest gains on multi-hop (HotpotQA) and wide-coverage (WikiHop) benchmarks. Ablation studies show removal of latent retrieval or heuristic BFS retrieves (explicit graph) each incur 3–7 point drops in accuracy. Varying graph capacity 9 highlights dataset-dependent trade-offs, with performance saturating around 0 edges.
Inference latency is nearly flat with respect to context length, with timings (1.5B backbone, context 16k tokens) showing 12.5s for LatentGraphMem (vs. MemGen 10.6s, A-Mem 20.0s at 6k tokens; LatentGraphMem 13.8s, A-Mem 41.9s at 10k tokens).
7. Limitations and Prospects
LatentGraphMem's strengths include robust scaling to extremely long contexts, fixed-budget explicit evidence for interpretability, and parameter-efficient LoRA-based adaptation for various LLM reasoners. However, the system depends on the quality of graph extraction, with extraction errors impacting downstream QA performance. Budgets 2 require per-task tuning. The model is presently text-only, lacking support for multi-modal or interactive settings.
Potential future directions include incorporation of node embeddings and adjacency prediction, regularization of the latent space (e.g., using Laplacian or contrastive losses), expansion to dialog, multi-agent, or multi-modal documents (including vision+text), and exploration of dynamic subgraph budgets conditioned on question complexity. A plausible implication is that further integration of graph-regularization and adaptivity could yield additional improvements in interpretability and performance (Zhang et al., 6 Jan 2026).
In summary, LatentGraphMem operationalizes the paradigm “store latent, retrieve explicit”, achieving an overview of efficient, stable memory management with controlled, interpretable reasoning evidence for long-horizon LLM applications.