MemSinks: Memory & Isolation in Video, LLMs, MEMS

Updated 26 May 2026

MemSinks are mechanisms that anchor or isolate information across domains, enabling long-range context retention in video, selective memorization in LLMs, and precise density sensing in MEMS.
In video generation, dynamic MemSinks like DySink utilize a retrieval-based framework and anomaly gating to maintain temporal consistency and prevent sink collapse.
For LLMs and MEMS sensors, MemSinks offer controlled memory isolation and viscosity-independent measurements, ensuring improved performance and efficiency.

MemSinks refers to a family of mechanisms, architectures, or device paradigms that serve as context-anchoring, information-localization, or decoupled-sensing elements in computational models and MEMS-based sensors. Three distinct but technically rigorous usages of the term are prevalent in recent literature: as long-range memory anchors in autoregressive video generation, as neuron blocks isolating memorized content in LLM training, and as viscosity-decoupled readout structures in MEMS ultrasonic density sensors.

1. MemSinks in Autoregressive Video Generation

In streaming autoregressive diffusion transformer models for long video synthesis, MemSinks (also termed "frame sinks") are explicit allocations of KV-cache state corresponding to previously generated frames that are retained as long-range anchors throughout the generation process. Standard workflows, as in Caus Vid, Self-Forcing, and LongLive, employ static early-frame MemSinks: an initial block of $S$ frames is kept in memory throughout generation, augmented by a short-term sliding window of $W$ most recent frames for continuity (Ye et al., 20 May 2026).

The limitation of this static allocation is twofold: (1) as the video progresses and diverges from the starting scene, the relevance of these early frames as context declines rapidly, yet they persist in memory while intermediate (potentially more semantically aligned) frames are evicted under a fixed memory budget $S+W$ ; (2) repeated attention to these outdated sinks—exacerbated by RoPE-induced phase realignment—can induce "sink collapse", where multiple heads converge in attention to the static sinks, causing generative content to regress to earlier states and resulting in periodic, cyclic, or abruptly resetting motion artifacts.

2. DySink: Dynamic Retrieval-Based MemSinks Framework

The DySink framework addresses static-sink limitations by replacing fixed early-frame caches with a dynamic, retrieval-based memory bank that stores blockwise descriptors and key-value (KV) caches, supporting adaptive context selection (Ye et al., 20 May 2026).

2.1 Memory Structure and Update

Each entry in the DySink memory bank $M = \{ (f_i, K_i) \}$ consists of:

$f_i \in \mathbb{R}^d$ : a normalized latent visual descriptor for block $i$ , constructed by mean-pooling over $N_i$ frames after encoding with a frozen visual encoder and applying $L_2$ normalization.
$K_i$ : a per-layer set of KV-cache states.

A deduplication threshold $T_{\text{dedup}}$ (e.g., 0.98) governs admission to $W$ 0, preventing near-duplicates, as only blocks with $W$ 1 are retained.

2.2 Retrieval and Injection

At each generation block $W$ 2, the descriptors for the last $W$ 3 sliding-window blocks are used as queries. Historical blocks in $W$ 4 are scored for relevance by their mean cosine similarity $W$ 5. The top- $W$ 6 are selected, and their KV caches are concatenated with the sliding window caches for each transformer layer. This enables conditioning on contextually salient (rather than temporally oldest) history.

3. Stability: Sink Anomaly Gate and Collapse Avoidance

To suppress failure modes arising from collective inter-head convergence onto retrieved context (i.e., sink collapse), DySink employs a per-layer, per-block binary anomaly gate (Ye et al., 20 May 2026). For layer $W$ 7, the affinity of each head $W$ 8 to local and retrieved KV sets is computed, with the fraction $W$ 9 of heads preferring retrieval over locality tallied for each block $S+W$ 0. If $S+W$ 1 (typ. 0.5), the problematic context is dropped ( $S+W$ 2), otherwise retained. This low-cost, hard gating scheme prevents homogenization of attention and cyclic resets.

Ablation confirms the necessity of the anomaly gate: disabling it drops dynamic degree from 63.52 to 37.98 at 50s despite preserved framewise metrics.

4. Experimental Protocols and Quantitative Outcomes in Video Models

DySink is trained in two stages: short-horizon distillation via Distribution Matching Distillation (DMD) from a causal generator, followed by long-horizon LoRA fine-tuning enabling full retrieval and gating behavior (Ye et al., 20 May 2026). Benchmarks use VBench and MovieGen for multi-pronged assessment: text alignment, temporal quality, dynamic degree.

Key improvements:

At 50s: Dynamic Degree increases from 42.40 (LongLive static sinks) to 63.52 (+21.12), Temporal Quality rises from 89.29 to 91.63.
At 75–100s: Dynamic Degree remains $S+W$ 3 for DySink vs. $S+W$ 4 for static sinks, with higher temporal stability.
Benefits persist beyond the fine-tuned 60s horizon, demonstrating robust generalization.

Table summarizing key metrics at 50s:

Model	Dynamic Degree	Temporal Quality
LongLive (static)	42.40	89.29
DySink	63.52	91.63

5. MemSinks for Memorization Isolation in LLMs

In LLM training, "Memorization Sinks" (MemSinks) refer to a subnetwork design in which a fixed pool of neurons per transformer MLP layer is allocated for sequence-specific memory (Ghosal et al., 14 Jul 2025). During pretraining, each sequence is tagged with an integer sequence ID $S+W$ 5. The hidden dimension $S+W$ 6 is split into shared ( $S+W$ 7) and sink-pool ( $S+W$ 8) components. A deterministic binary mask $S+W$ 9, sampled from $M = \{ (f_i, K_i) \}$ 0, is applied to the sink-pool per layer, activating only a unique set of sink neurons for each sequence.

Under this regime:

Repeated passages trigger memorization to accumulate in the associated sink neurons.
At inference, zeroing out these neurons (i.e., dropping the corresponding subset by mask) erases the memorized sequence from the model, mitigating both privacy and copyright concerns.
The primary method of isolation is sequence-tied dropout; no explicit memorization regularizer is required.

6. Learning, Forgetting Dynamics, and Empirical Results in MemSinks-LLM

Standard training under entangled representations induces irreversible mixing of memorization and generalization; post-hoc removal of memorized material damages general performance. The MemSinks mechanism enforces separation: the sink-blockes accumulate sequence-specific logits while the shared block maintains generalization, as formalized in Theorem H.3 of (Ghosal et al., 14 Jul 2025).

Quantitative highlights:

For TinyStories (344M), after dropping MemSink neurons, the cross-entropy loss on repeated docs rises from ≈0.1 to ≈0.95 (near “never-seen” baseline), while held-out val loss shift is <0.02.
At billion-token scale (SmolLM 360M), memorization loss is halved (from 1.2 to 2.3), with validation perplexity unchanged (4.05 vs 4.2).

Empirical ablation shows standard gradient-based forgetting cannot disentangle memory from generic language ability; MemSinks achieves selective erasure without collateral degradation.

7. MemSinks in MEMS Ultrasonic Density Sensing

In MEMS-based fluid-density sensors, "MemSinks" designates a paired structure of AlN-on-SOI PMUTs (Piezoelectric Micromachined Ultrasonic Transducers) operating in a transmit–receive modality (Fu et al., 3 Jan 2026). One PMUT transmits a CW ultrasonic signal into the liquid; the wave reflects at the liquid–air interface and is detected by the receiver PMUT. The receiver output amplitude $M = \{ (f_i, K_i) \}$ 1 is directly modeled as a function of fluid density $M = \{ (f_i, K_i) \}$ 2 via the "added mass" principle, with negligible dependence on viscosity for typical operation ( $M = \{ (f_i, K_i) \}$ 3). Three frequency intervals are used to map $M = \{ (f_i, K_i) \}$ 4 monotonically over 0–100% glycerol concentration without requiring frequency sweeps.

Performance summary:

Measurable density range: 0.789–1.261 g/cm³
Error rate: <0.125% in low-viscosity fluids after cross-liquid calibration; <2.5% in high-viscosity media (80–100% glycerol), or one-fifth the error of frequency-based densitometers.
Sensing is rapid (<ms), uses microliter volumes, and is unaffected by viscosity variations up to 11× between different solutes.

The MemSinks device thus exemplifies robust, amplitude-driven, viscosity-independent MEMS density measurements suitable for embedded and miniaturized applications.

8. Comparative Advantages, Limitations, and Prospective Developments

Across domains, MemSink paradigms share the core function of anchoring, isolating, or robustly decoupling system memory for accuracy, safety, or efficiency:

In video modeling, dynamic, relevance-based sinks with anomaly gating decisively outperform static anchors, enabling sustained temporal dynamics and scene evolution (Ye et al., 20 May 2026).
In LLMs, MemSinks enable selective, post-hoc unlearning of memorized sequences without impairing distributional generalization, provided consistent sequence IDs can be maintained (Ghosal et al., 14 Jul 2025).
In MEMS sensing, the amplitude-only, frequency-fixed MemSinks architecture achieves high accuracy and range with minimal power, volume, and complexity, outperforming resonance-shift approaches in viscous media (Fu et al., 3 Jan 2026).

Noted limitations include dependence on metadata for MemSink assignment in LLMs, possible capacity tradeoffs in neuron allocation, scaling untested beyond billions of tokens/parameters, and, for sensor devices, mode limitations set by acoustic geometry and damping.

Future research directions encompass fine-grained or multi-level sink design (per topic, author, or subsegment), adversarially robust sequence isolation, and broader application of sink-block paradigms for editable or differential memory in neural architectures. In MEMS applications, further miniaturization, integration with CMOS readout, and expansion to multi-solute discrimination present logical extensions.

Markdown Report Issue Upgrade to Chat

References (3)

DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation (2026)

Memorization Sinks: Isolating Memorization during LLM Training (2025)

A Liquid Density Sensor Based On AlN Piezoelectric Micromachined Ultrasonic Transmitter Insensitive to Liquid Viscosity (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MemSinks.