REMem: Multi-Domain Memory and Representation
- REMem is a polysemous research label that denotes varied memory systems, including dynamic, recurrent, and procedural mechanisms.
- It spans diverse domains such as retrieval-augmented generation, vision-language-action, long-running agents, and radio environment mapping.
- Dynamic retention and context-aware retrieval techniques in REMem lead to improved performance metrics across multimodal and robotic applications.
REMem is not a single standardized technical object in the arXiv literature. The label, together with closely related spellings such as ReMem, ReMe, REM, and “Remember Me”, has been used for several distinct systems spanning retrieval-augmented generation, long-running LLM agents, Large Vision-LLMs, Vision-Language-Action control, procedural memory for tool-using agents, video segmentation, medical image registration, and radio-environment mapping (Bursa, 4 Jan 2026, Gao et al., 13 Nov 2025, Li et al., 13 Mar 2026, Cao et al., 11 Dec 2025, Bagchi et al., 2024, Sun et al., 2021, Wei et al., 2013). In most of the recent agent and multimodal work, the name denotes mechanisms for selective retention, contextual reinstatement, or recurrent consolidation; in other fields, it is an acronymic reuse with unrelated semantics.
| Usage | Domain | Core idea |
|---|---|---|
| ARM / “REMem” (Bursa, 4 Jan 2026) | Retrieval-Augmented Generation | Dynamic memory substrate with selective remembrance and decay |
| “Remember Me” / T-DRS (Gao et al., 13 Nov 2025) | RoPE-based LVLMs | Inference-only compensation for long-range attention decay |
| ReMem-VLA (Li et al., 13 Mar 2026) | Vision-Language-Action | Dual-level recurrent memory queries |
| ReMe (Cao et al., 11 Dec 2025) | Tool-using LLM agents | Dynamic procedural memory with distillation, reuse, and refinement |
| RecMem / RaMem / Evo-Memory ReMem (Dai et al., 15 May 2026, Yang et al., 22 Jun 2026, Wei et al., 25 Nov 2025) | Long-running agents | Recurrence-triggered consolidation, contextual reinstatement, or action-think-memory refinement |
| ReMem benchmark (Kwon et al., 5 May 2026) | LVLM memorization and unlearning | Reliable multi-hop and multi-image memorization benchmark |
| REM (Bagchi et al., 2024, Sun et al., 2021, Wei et al., 2013) | Vision and wireless systems | Diffusion-based video segmentation, resolution enhancement, or radio environment maps |
1. Terminological scope and disambiguation
In current usage, REMem is best understood as a polysemous research label rather than a canonical acronym with a fixed expansion. In some papers it denotes explicit memory systems—dynamic RAG memory, recurrent agent memory, contextual reinstatement, or procedural experience pools—whereas in others it appears as REM for unrelated constructs such as a resolution enhancement module or a radio environment map (Bursa, 4 Jan 2026, Sun et al., 2021, Wei et al., 2013). A common misconception is therefore to treat every “REMem” paper as belonging to a single lineage. The literature does not support that interpretation.
The strongest cluster of usages is memory-centric. These systems typically externalize state, update it over time, and make retention or retrieval conditional on usage, recurrence, context, or utility. By contrast, the video-segmentation, medical-imaging, and wireless-networking papers use REM as a compact acronym for domain-specific modules or maps, not as a unified memory formalism (Bagchi et al., 2024, Sun et al., 2021, Turkmen et al., 2020).
2. Dynamic retrieval and long-running agent memory
In retrieval-augmented generation, Adaptive RAG Memory (ARM) is explicitly described as a “REMem” system. ARM replaces a static vector index with a dynamic memory substrate storing, for each item , an embedding , an access count , a last-access time , and a remembrance flag . Retrieved items are consolidated once , while stale, unremembered items decay according to after a grace period . The paper’s balanced configuration is . On a lightweight retrieval benchmark, ARM reports , 0, 1, with a 22M-parameter embedding layer, and the end-to-end study reports that Llama 3.1 with static RAG achieves 67.2% key-term coverage while GPT-4o with dynamic selective RAG reaches the fastest responses at 8.2 s with 58.7% coverage (Bursa, 4 Jan 2026).
For long-running LLM agents, RecMem recasts consolidation as a recurrence-triggered operation. Each interaction unit 2 is embedded as 3 and stored in a subconscious layer 4. Recurrence is detected by retrieving semantically similar past items and checking 5, where 6. Only then are episodic and semantic memories extracted, followed by semantic refinement to recover omitted details. Experiments report memory-construction token reductions of up to 87% relative to Mem0, A-Mem, and MemoryOS while exceeding their accuracy (Dai et al., 15 May 2026).
RaMem addresses a different failure mode, termed context collapse: retrieved memory fragments may be topically relevant yet invalid as evidence for the current query. Its four stages are evidence anchoring, recall condition induction, validity-aware retrieval, and context-preserved synthesis. Each memory is represented as 7, where 8 carries event time, mention time, session span, participants, location, entities, and topic. At query time, a recall frame 9 is induced and used to prioritize context-compatible memories. On long-term memory benchmarks, the paper reports average F1 gains of more than 10% across several backbones (Yang et al., 22 Jun 2026).
Within the Evo-Memory benchmark, ReMem is an explicit action–think–memory refine loop. At each internal step the agent chooses 0, allowing memory reasoning to become part of the action space rather than a passive RAG component. Under the benchmark’s streaming setup, ReMem reaches an average of 0.65 on the single-turn benchmarks with Gemini 2.5 Flash and 0.50/0.64 average success/progress across the multi-turn environments; with Claude 3.7 Sonnet it reaches 0.58 on the single-turn average and 0.78/0.91 average success/progress on the multi-turn environments, consistently outperforming history-only baselines and improving step efficiency (Wei et al., 25 Nov 2025).
3. Procedural memory and experience-driven agent evolution
In tool-using LLM agents, ReMe (“Remember Me, Refine Me”) is a procedural memory framework built around three mechanisms: multi-faceted distillation, context-adaptive reuse, and utility-based refinement. Memory entries are represented as 1, where 2 is a usage scenario, 3 is experience content, 4 are keywords, 5 is a confidence score, and 6 records tools used. Distillation extracts success patterns, failure analysis, and comparative insights from trajectories; reuse retrieves memories by embeddings of the usage scenario; refinement performs selective addition and deletion based on observed utility (Cao et al., 11 Dec 2025).
The deletion rule is explicitly utility-based: 7 with 8 the retrieval count and 9 the number of successful uses. In the dynamic configuration on BFCL-V3 and AppWorld, Qwen3-8B improves from Avg@4/Pass@4 0 without memory to 1 with dynamic ReMe. On BFCL-V3 alone, the same model improves from 2 to 3, and on AppWorld from 4 to 5. The paper further reports a memory-scaling effect: Qwen3-8B with dynamic ReMe slightly exceeds memoryless Qwen3-14B on Pass@4, and Qwen3-14B with dynamic ReMe exceeds memoryless Qwen3-32B on both Avg@4 and Pass@4 (Cao et al., 11 Dec 2025).
This line of work treats memory as an evolving procedural substrate rather than a trajectory log. The refinement stage is central: full addition from all trajectories yields only 6 on BFCL-V3 for Qwen3-8B, selective addition reaches 7, and selective addition plus reflection plus deletion reaches 8, indicating that memory quality control, not only memory quantity, is the operative variable (Cao et al., 11 Dec 2025).
4. Multimodal long-range memory, robotics, and unlearning
In RoPE-based Large Vision-LLMs, “Remember Me” denotes T-DRS, a training-free, inference-only modification to attention logits. The method inserts three components between RoPE attention computation and the softmax: 9 where SD-DRS derives a semantic scale from cosine similarity, DC-DRS applies a Gaussian-like distance-aware control term, and reRD-DRS adds a heavy-tailed long-range reinforcement term. On VQA benchmarks, LLaVA1.5-7B improves from 67.9 to 69.2 on ScienceQA, 62.0 to 63.1 on GQA, and 58.2 to 59.0 on TextVQA; analogous gains are reported for InterVL2-8B and Qwen2.5-VL-7B, all without retraining (Gao et al., 13 Nov 2025).
For embodied control, ReMem-VLA adds memory to Vision-Language-Action models through two sets of recurrent queries: frame-level 0 for short-term memory and chunk-level 1 for long-term memory. The frame-level state is updated every step by
2
while the chunk-level state is updated only every 3 frames. A bidirectional connector allows action queries and hindsight queries to read from these recurrent memory slots, and an auxiliary Past Observation Prediction loss 4 strengthens visual memory. On MemoryBench plus a long-horizon task, ReMem-VLA reaches 93, 99, 100, and 86 success, averaging 94.5, versus 0.75 for OpenVLA-OFT, 8.25 for 5, and 1.5 for MemoryVLA. In four real-world robot tasks it reports 82.5% average success, compared with 11% for 6 and 8% for MemoryVLA (Li et al., 13 Mar 2026).
A different multimodal use of the name appears in ReMem, the Reliable Multi-hop and Multi-image Memorization Benchmark for LVLM unlearning. Its premise is that existing LVLM unlearning benchmarks often fail at stage 1: the model never robustly memorizes the target fictitious identities, so unlearning results are unreliable. ReMem therefore scales each identity to 100 images and 100 QA pairs, with an empirically chosen 70% single-hop / 30% multi-hop split. After fine-tuning on ReMem, LLaVA-1.5-7B reaches ROUGE 97.19, GPT-score 95.18, EM 91.50, and held-out 7 81.33; LLaVA-1.5-13B reaches ROUGE 98.92, GPT-score 98.05, EM 96.37, and 8 87.98. The benchmark also introduces Exposure,
9
a normalized rank-based measure of how highly the model internally scores the true sensitive attribute among plausible alternatives (Kwon et al., 5 May 2026).
5. Vision, representation, and medical uses of REM
Not every REM-labeled system is a memory architecture. In video understanding, REM in “ReferEverything” is a framework for referral video segmentation that repurposes a pre-trained text-to-video diffusion model. It retains the original U-Net denoiser, VAE encoder/decoder, and CLIP text encoder, but changes the objective from denoising to mask-latent prediction. Inference is written as
0
REM matches or slightly exceeds state of the art on Ref-DAVIS, reaches 40.4 1 on BURST, 15.2 2 on VSPW stuff categories, and 49.56 3 on the Ref-VPS process benchmark, outperforming VD-IT’s 37.58 by about 12 points (Bagchi et al., 2024).
In knowledge distillation for vision transformers, ReMem denotes a teacher-side modification that couples mutual-information-aware fine-tuning with MLP reweighting. The modified transformer block is
4
which downweights top MLP blocks that the paper identifies as major sources of mutual-information loss. Combined with SAM-based fine-tuning, this turns very strong pretrained ViTs into better teachers: averaged over 16 datasets, a ViT-B teacher improves student performance from 74.0 to 78.3 while teacher accuracy changes from 86.7 to 85.7, and similar reversals of the “stronger teacher, worse student” trend are reported for ViT-Ti, ViT-S, and ViT-L (Dong et al., 29 Jun 2025).
In medical imaging, REM denotes a Resolution Enhancement Module: a lightweight 3D CNN super-resolution front-end plugged into deformable registration networks. The selected design is REM-Variant-I with global image-domain residual learning and configuration 5. In the ReFDRN cascade, the main registration loss is
6
with an auxiliary Huber-based loss on REM outputs. On LPBA40 at 4× upscaling, ReFDRN improves Dice/NCC from 0.6676/0.9920 for trilinear-upsampled FDRN to 0.6736/0.9962, while ReVoxelMorph improves from 0.6593/0.9916 to 0.6676/0.9932 (Sun et al., 2021).
6. Radio environment maps and monitoring
In wireless systems, REM has a much older and unrelated meaning: Radio Environment Map. A foundational formulation partitions a region into 7 meshes and assigns each location a bit-packed radio parameter
8
where 9 indicates whether network 0 is detected. The resulting radio parameter error 1 decreases with mesh density, and the paper derives the scaling law
2
together with a linked notion of geographic entropy and deployment analyses for one-mesh-one-sensor and random sensor placement (Wei et al., 2013).
Later work generalizes this idea into G-REM, or generalized radio environment monitoring, which broadens classical REM from spectrum occupancy and interference maps to a multi-dimensional framework including CSI, localization, mobility, network state, device state, and external context. G-REM explicitly integrates sensing modes, sensing methods, mapping methods, external information sources, and applications such as beam management, CoMP, mobility-aware handover, physical-layer security, and RIS deployment (Turkmen et al., 2020).
A concrete autonomous instantiation is the UAV-supported generation of fine-grained 3D indoor REMs. In that system, Crazyflie 2.1 UAVs carrying Wi-Fi scanning receivers visit 72 waypoints inside a 3 volume, collect 2696 samples from 73 MAC addresses, and train an ML regressor to predict RSS at unsampled points. The best reported predictor is a kNN regressor with MAC one-hot features scaled by a factor of 3 and 4, achieving RMSE 5 (Mendes et al., 2021).
Across these literatures, REMem therefore denotes a family of names rather than a single method. In the memory-centric strand, it typically refers to mechanisms for selective retention, contextual verification, or recurrent consolidation under long-horizon inference (Bursa, 4 Jan 2026, Dai et al., 15 May 2026, Yang et al., 22 Jun 2026). In multimodal and robotic work, it extends to long-range attention repair, recurrent latent state, and memorization benchmarking (Gao et al., 13 Nov 2025, Li et al., 13 Mar 2026, Kwon et al., 5 May 2026). In several other fields, however, REM is simply an acronym reused for unrelated modules and maps (Bagchi et al., 2024, Sun et al., 2021, Wei et al., 2013). The term is thus encyclopedically best treated as a cross-domain label whose meaning is determined by the specific paper and application domain.