PCMB: Perceptual–Cognitive Memory Bank

Updated 30 November 2025

PCMB is an architectural memory system that fuses detailed perceptual inputs with cognitive summaries to support temporally extended decision-making.
It employs cross-modal encoding, token banks, and attention-based retrieval mechanisms to consolidate episodic experiences efficiently.
Empirical evaluations in robotic manipulation reveal significant performance gains through joint perceptual–cognitive data fusion.

A Perceptual–Cognitive Memory Bank (PCMB) is an architectural principle for memory systems in autonomous agents and robotic models, integrating fine-grained perceptual details with high-level cognitive summaries for temporally extended decision-making. PCMB structures enable agents to buffer, retrieve, and consolidate episodic experiences across long horizons, capturing both verbatim sensory experiences and semantic insights. Recent frameworks such as MemoryVLA for robotic manipulation operationalize PCMBs via specialized token banks and cross-modal fusion, while universal memory architectures formalize such banks as minimal median-graph models derived from sensory equivalence classes and logical relations among observations.

1. Structural Definition and Mathematical Foundations

A PCMB comprises parallel streams for perceptual and cognitive modalities. In MemoryVLA, the working memory at timestep $t$ is defined as:

$M_{\text{wk}}^{(t)} = \{ p_t \in \mathbb{R}^{N_p \times d_p},\; c_t \in \mathbb{R}^{1 \times d_c} \}$

where $p_t$ are $N_p$ perceptual tokens (dimension $d_p$ ), and $c_t$ is a cognitive token (dimension $d_c$ ) summarizing semantic context (Shi et al., 26 Aug 2025).

The memory bank proper factorizes into two episodic stores:

$M_{\text{pcmb}} = \{ M_p,\; M_c \}$

with $M_p = \{ m^p_i \in \mathbb{R}^{N_p \times d_p} \}_{i=1}^L$ and $M_c = \{ m^c_i \in \mathbb{R}^{1 \times d_c} \}_{i=1}^L$ , storing $L$ entries per stream.

In symbolic and discrete formulations, the PCMB is represented as a weak poc-set $(S, \le, *)$ , where $S$ is a set of Boolean sensors, $*$ is sensor complementation, and $\le$ codes observed logical implications ( $a \le b$ if $a$ 's firing implies $b$ 's firing). The associated dual cubical complex $X(P) = \text{Cube}(P)$ models the combinatorial state space, capturing all coherent sensor selections consistent with agent beliefs (Guralnik et al., 2015).

2. Encoding, Retrieval, and Consolidation Mechanisms

Each perceptual or cognitive episode is encoded by external visual-linguistic models:

Perceptual encoding: Dense visual features from DINOv2 and SigLIP are pooled and bottlenecked, yielding $p_t$ tokens ( $N_p = 256$ ).
Cognitive encoding: Visual features are projected into the language space, concatenated with tokenized instruction, and processed by LLaMA-7B; the EOS token $\rightarrow c_t$ .

Retrieval leverages parallel cross-attention using sinusoidal timestep encoding:

Keys $K_x = [m^x_1 + TE(t_1); \ldots; m^x_L + TE(t_L)]$ , values $V_x = [m^x_1; \ldots; m^x_L]$ , query $q_x = p_t$ or $c_t$ .
Attention weights and embeddings are computed, followed by Transformer layers for memory contextualization.

Consolidation first merges observed and retrieved tokens via gate fusion:

$g_x = \sigma(\text{MLP}([q_x; H_x])),\quad \tilde{x} = g_x \odot H_x + (1-g_x) \odot q_x$

Then, upon exceeding $L$ , a “sliding-window” merge step coalesces the most similar pair (cosine similarity), maintaining compact, nonredundant memory.

In universal architectures, empirical updates maintain snapshot weights $w_{ab}$ for sensor coactivation, reconstructing the poc-graph and enforcing coherence through state projection algorithms, all in $O(|S|^2)$ per cycle (Guralnik et al., 2015).

3. Planning, Temporal Context, and Action Generation

PCMBs ground temporally-aware control via memory-conditioned action experts. In MemoryVLA, a diffusion pipeline (DiT-DDIM) takes fused perceptual and cognitive representations for prediction of future 7-DoF action sequences. Denoising steps combine “cognition-attention” and “perception-attention” cross-attentions, integrating step embeddings, after which feedforward layers yield the action output.

Symbolic PCMBs exploit the CAT(0) geometry of $\text{Cube}(P)$ : convex goal regions can be defined, and greedy projection algorithms find nearest states by sensor flipping. Goal-directed planning propagates subgoal signals through the poc-graph, with action selection determined by predictive overlap with these subgoals (Guralnik et al., 2015).

4. Minimality, Topological Recovery, and Duality Properties

Given a sample of observations, the dual cubing $\text{Cube}(P)$ constructed from the associated poc-set $P$ is the minimal representation separating all experienced sensory equivalence classes. Any coarser poc set would collapse distinctions among observed states, violating faithful modeling.

For sufficiently expressive sensors, the subcomplex of the cubing “witnessed” in data (punctured model) recovers the homotopy type of the true underlying state-space, by virtue of the nerve theorem and correspondence between intersections in sensor logic and topological covers.

A category-level duality exists between weak poc-sets and median graphs, with every finite median graph being the 1-skeleton of the cubing derived from some poc-set. Key lemmas characterize coherent sensor selections, median embeddings, and convex half-spaces (Guralnik et al., 2015).

5. Empirical Performance and Comparative Advantages

MemoryVLA demonstrates PCMB’s practical benefit in robotic manipulation:

Achieves +14.6% on SimplerEnv-Bridge, +4.6% on Fractal, +3.3% on LIBERO-5, and +26% on real-world long-horizon tasks relative to baselines.
Ablations show joint perceptual–cognitive memory usage (71.9%) outperforms isolated streams (64.6% and 63.5%); gate fusion and token-merging contribute +4.2% and +5.2% gains.
Retrieval complexity remains $O(L \cdot d^2)$ , and explicit verbatim/semantic separation mimics hippocampal memory trace.

PCMB architectures yield robust handling of non-Markovian domains—supporting “remembrance-of-past” for visually subtle temporal tasks, enhanced generalization under out-of-distribution sensory conditions, and data-efficient planning (Shi et al., 26 Aug 2025).

6. Application Case Studies and Concrete Examples

In discrete path environments, PCMB formalism realizes effective reactive planning:

State sensors encode position; action sensors trigger transitions.
Learned logical nesting translates to cubing geometry (rectangular prism with action branches).
Target states specify convex sets in cubing, and greedy action selection traverses these regions efficiently.

As learning advances, PCMB encodes not only environmental geometry but also causal action consequences, producing a full pipeline from binary sensory perception to symbolic planning (Guralnik et al., 2015).

7. Contextual Significance and Research Directions

PCMB bridges neurocognitive motivation and computational formalism for memory-based control. Biological inspiration includes analogy to human working memory and hippocampal episodic storage. The technical separation and fusion mechanisms enable both preservation of critical information and efficient consolidation.

This suggests that PCMB research offers a principled route to temporally coherent, generalizable, and adaptable decision-making architectures in autonomous robotics and cognitive systems.