Image-Centric Memory Architectures

Updated 29 December 2025

Image-centric memory is a framework where visual features are stored in dedicated banks to drive multi-modal reasoning and perception.
It employs mechanisms like ROI-based retrieval, dot-product attention, and kNN search to enable efficient image editing and large-scale retrieval.
Integrating these architectures enhances interpretability, scalability, and robustness in applications ranging from navigation to generative tasks.

An image-centric memory framework refers to a category of computational and neural network architectures in which the explicit storage, retrieval, or manipulation of image-derived representations underpins memory-augmented reasoning, perception, generation, or decision-making. In contrast to designs centered on symbolic tokens, language, or object slots, image-centric memory maintains and operates over visual features or visual region embeddings, either in raw, compressed, or learned latent forms. These frameworks can be instantiated in areas including multi-image reasoning in vision-LLMs, lifelong agent navigation, continual learning, few-shot perceptual tasks, image editing, and generative or discriminative tasks. Key developments include test-time retrieval-augmented transformers, database-backed memory for image similarity, attention-based memory access, and structured, class-aware style banks.

1. Formal Structure and Mechanisms

Several recent architectures operationalize image-centric memory as a dedicated read-only or read–write bank, storing visual feature keys and values derived from either convolutional/transformer encoders or latent spaces. A canonical instance is the Retrieval-based Image Feature Reasoning Enhancement Module (RIFREM) from the CMMCoT framework (Zhang et al., 7 Mar 2025). For N input images and L cross-modal decoder layers, RIFREM’s memory bank $\mathcal{M}$ comprises tuples: $\mathcal{M} = \{ (K_i^l, V_i^l) \mid 1 \leq i \leq N, 1 \leq l \leq L \}$ where $K_i^l \in \mathbb{R}^{T \times d_k}$ and $V_i^l \in \mathbb{R}^{T \times d_v}$ represent the cross-attention keys and values for all visual tokens in image $i$ at layer $l$ . The number of memory slots is thus $N \times L$ , each containing multi-token representations.

Reading from image-centric memory typically involves identifying a Region-of-Interest (ROI) during sequence generation, cropping and encoding it to obtain a query vector $Q$ . Retrieval is realized by scaled dot-product attention between $Q$ and the relevant subset of memory keys/values: $Q' = \text{softmax}\left( \frac{Q K_M^\top}{\sqrt{d_k}} \right) V_M$ The refined vector $Q'$ is injected into subsequent decoder layers, directly influencing later reasoning steps, as in the multi-modal chain-of-thought formulation. Notably, memory contents are fixed at inference time, with no parameter updates during retrieval; this maintains computational efficiency and reproducibility.

Other paradigms leverage k-nearest-neighbor (kNN) databases for ultra-large-scale visual retrieval, as in the “visual memory” framework (Geirhos et al., 2024), or layerwise storage of edit histories for iterative image generation (Kim et al., 2 May 2025). In all cases, explicit memory access enables dynamic, data-driven expansion of effective model capacity beyond what is possible via architectural scale alone.

2. Integration with Reasoning, Generation, or Perceptual Pipelines

Image-centric memory is often embedded within larger chain-of-thought, generation, or perception pipelines. In CMMCoT, the memory is closely tied to a multi-modal, chain-of-thought reasoning process. At each reasoning step, production of an image tag (e.g., <IMG>i</IMG>) and bounding box triggers the cropping, encoding, and memory retrieval process. The entity-level features generated in this manner refine subsequent decoder states, enhancing region-specific reasoning while maintaining interpretability (Zhang et al., 7 Mar 2025).

In sequential image editing, layer-wise memory stores historical latent representations and textual prompt embeddings at each diffusion block. This enables both “background consistency guidance”—mask-restricted blending of old and new content at each edit step—and “multi-query disentangled cross-attention,” which performs masked, disentangled attention to past and current edit queries within each block (Kim et al., 2 May 2025).

In visual memory databases for classification or retrieval, the system decouples representation from storage. A pre-trained embedding function $f: X \to \mathbb{R}^d$ maps any input $x$ to $z=f(x)$ , and large-scale indexing/querying structures (e.g., ScaNN, HNSW) enable scalable, interpretable, and highly flexible reasoning by literally searching over codes derived directly from image data (Geirhos et al., 2024).

3. Training Strategies, Supervision, and Memory Construction

The construction of image-centric memory can be either static (offline), dynamic (online/episodic), or a hybrid. In supervised memory-augmented transformers, memory is built from training set features and maintained as a persistent resource. In CMMCoT, the two-stage supervision process first trains on a large multi-image dataset (CMMCoT-260K), then mixes in general multimodal data for broader transfer. The explicit next-token cross-entropy loss over generated tokens suffices; there are no auxiliary losses applied to the memory keys/values, and memory refinement is supervised only indirectly via ROI generation and downstream QA accuracy (Zhang et al., 7 Mar 2025).

In memory-augmented visual working memory models such as Hebb–Rosenblatt memory and STAWM, the associative memory weights are updated online using outer-product Hebbian rules and then queried for downstream classification or reconstruction (Harris et al., 2019).

In kNN-based or database-backed memory frameworks, construction is purely algorithmic: the image features from all training images (or a subset) are stored, and the system is trained to maximize either classification accuracy or retrieval-based objective functions. Operations like hard/soft pruning and perfect unlearning are trivial to achieve via database manipulation (Geirhos et al., 2024).

4. Empirical Outcomes and Experimental Benchmarks

Empirical validation has demonstrated the substantive value of image-centric memory in several domains:

CMMCoT’s RIFREM module yields significant performance gains on complex multi-image QA and reasoning tasks: +6.1 pp on Mantis-Eval, +4.7 pp on NLVR2, and consistent gains relative to Qwen2-VL baselines (Zhang et al., 7 Mar 2025).
In visual memory databases using DINOv2 ViT-L/14, scaling the memory to 1.28M ImageNet entries pushes kNN accuracy to ~83.6%; scaling to billion-entry memory (JFT pseudo-labels) gives ~87% (Geirhos et al., 2024).
Iterative image editing with layer-wise memory achieves higher semantic and visual alignment in multi-step editing tasks (BLEU-2/3/4: 64.99/47.69/36.59 at 1024×1024 resolution), outperforming inpainting, in-layout, and baseline methods (Kim et al., 2 May 2025).
In navigation/cognitive mapping, compressed image-centric memory enables an agent to attend over hundreds of historical frames and yields state-of-the-art improvement in navigation success rate and path efficiency on complex embodied tasks (Ren et al., 25 Dec 2025).

These results indicate that, for both reasoning and generation, incorporating explicit image-derived memory structures significantly augments the problem-solving capabilities of vision-centric models, particularly in multi-context and multi-step tasks.

5. Comparative Variants and Operations

Image-centric memory manifests in several architectural regimes, distinguished by memory organization and access patterns:

Framework	Memory Type	Access Pattern	Supervision/Training
CMMCoT (RIFREM) (Zhang et al., 7 Mar 2025)	Read-only key/value bank	ROI-based dot-product retrieval	Stagewise next-token XE loss
kNN visual memory (Geirhos et al., 2024)	Flat feature database	Approximate NN search	Embedding pretraining only
Layerwise generation (Kim et al., 2 May 2025)	Per-layer edit history	Masked blending and multi-query attention	Diffusion and classifier-free guidance
Hebbian visual memory (Harris et al., 2019)	Fast-weight outer-product	Sequential associative update	Joint cross-entropy + MSE

Notably, all approaches balance parameter efficiency (no/few learned parameters in memory structure itself) with scalable, dynamic extension of model capacity—either at test-time or continuously.

6. Interpretability, Flexibility, and Practical Considerations

Explicit image-centric memory imparts several interpretability and flexibility advantages:

Clear attribution: predictions can be traced directly to retrieved visual or memory slots (Geirhos et al., 2024).
Intervention: easy manual unlearning or editing of stored entries. For example, removing a mislabeled training sample from kNN memory alters downstream predictions immediately, without retraining.
Lifelong and continual learning: external memory can be dynamically extended as new classes or domains emerge, without catastrophic forgetting or model retraining bottlenecks.
Efficient scaling: Visual memory can be compressed (e.g., via tokenization or latent quantization) to increase context capacity, supporting long-horizon agents and high-resolution multi-image tasks (Ren et al., 25 Dec 2025).

However, there are trade-offs: memory access demands rapid indexing at large scale, interpretability breaks down if the underlying encoder changes, and raw kNN accuracy may still lag full end-to-end retrained networks unless memory is massive. Additionally, all architectures are constrained by the quality of visual features fed into the memory, necessitating periodically upgrading or recalibrating the image encoder when faced with substantial distribution shifts.

7. Applications and Future Directions

Image-centric memory frameworks are foundational in multi-modal reasoning, lifelong navigation, large-scale image storage/retrieval, memory-augmented captioning, and interactive iterative editing domains. Ongoing research targets challenges in learnable memory access policies, visual-semantic information fusion, automatic pruning and curation, and hardware/algorithm co-design for efficient memory management at scale. Generalization to richer modalities (e.g., video-centric memory, cross-modal associative memory) and integration with unsupervised/self-supervised representation learning are key future frontiers.

The rapid evolution of image-centric memory design underscores its centrality in next-generation, interpretable, scalable, and persistent visual intelligence architectures across both multimodal and pure vision tasks (Zhang et al., 7 Mar 2025, Geirhos et al., 2024, Kim et al., 2 May 2025, Ren et al., 25 Dec 2025).

Markdown Upgrade to Chat

References (5)

CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation (2025)

Towards flexible perception with visual memory (2024)

Improving Editability in Image Generation with Layer-wise Memory (2025)

A Biologically Inspired Visual Working Memory for Deep Networks (2019)

AstraNav-Memory: Contexts Compression for Long Memory (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Image-Centric Memory Framework.