Multi-Scale Frequency Memory
- Multi-scale Frequency Memory (MFM) is a neural architecture that employs frequency-domain encoding to represent information over varied temporal scales.
- It integrates parallel memory modules that update at different rates using biologically inspired heuristics to balance short-term detail retention with long-term summarization.
- Empirical evidence in sequence modeling and vision tasks demonstrates that MFM improves efficiency and accuracy compared to conventional memory approaches.
Multi-scale Frequency Memory (MFM) is an architectural principle and memory module design that enables neural models to capture, represent, and retrieve information across a broad range of temporal frequencies and timescales. By leveraging frequency-domain encoding, modular updates at multiple rates, and biologically inspired information-selection heuristics, MFM architectures aim to maintain high-fidelity short-term memory while robustly consolidating salient details and long-term context. MFM has been successfully applied in both sequence modeling (notably, recurrent neural networks) and vision-based streaming perception in LLMs, offering a general framework for efficient long-horizon sequence understanding (Li et al., 2 Feb 2026, Carta et al., 2020).
1. Theoretical Motivation and Biological Inspiration
Multi-scale Frequency Memory draws inspiration from psychophysical observations, particularly the Weber–Fechner Law, which posits that perceived information density for an event seconds in the past decays logarithmically: , for scaling constant and small to avoid singularity (Li et al., 2 Feb 2026). This suggests a memory structure where high fidelity is retained for recent inputs, while increasingly compressed, "gist-like" representations are maintained for the distant past.
Neural architectures face similar constraints: time-domain memory that treats all past states equally either incurs excessive computational cost or must periodically truncate context, leading to irreversible loss. Frequency-domain representations provide a mathematically grounded approach for summarization, in which low-frequency components encode long-term structure (scene/motion "gist") and high-frequency bands encode sharp, transient changes. This decomposition reflects the brain’s strategy of memory consolidation—episodic details fade while semantic contours persist.
2. Core Architectural Principles
MFM modules implement memory using a set of parallel, hierarchically structured memory tracks or modules, each operating at a distinct temporal frequency. In RNNs, this manifests as a collection of sub-memories, , each sampled and updated at exponentially increasing intervals: . The MFM state at time is the concatenation (Carta et al., 2020). Each module only updates when , otherwise it retains its previous value, guaranteeing that faster modules (short timescales) have higher update rates while slower modules robustly store long-range summaries.
In vision applications, such as FreshMem, MFM operates by projecting overflow visions frames into a basis of frequency coefficients via the Discrete Fourier Transform (DFT), yielding a compact and expressive global summary. The module manages memory using:
- Frequency coefficient buffer: Stores a rolling set of DFT coefficients, with update and selection guided by energy or log-uniform (Weber–Fechner) spacings.
- Residual buffer: Retains only "salient" frames, as identified by a proxy such as the -norm, analogous to amygdala-inspired selection of highly informative events.
- Reconstruction and reading: Past context is reconstituted during retrieval as , where are frequency coefficients and indicates sparse residuals (Li et al., 2 Feb 2026).
3. Frequency-Domain Operations and Memory Updates
In sequence tasks, MFM memory tracks are constructed and updated incrementally:
- Incremental update (vision frequency memory): On each overflow event (e.g., when a frame leaves the short-term buffer), update each of frequency coefficients using
where are log-uniformly spaced frequencies and enforces memory decay.
- Multi-scale RNN modules: Each module’s hidden state is updated using both current hidden activation and feedback from slower modules, with block upper-triangular interconnections ensuring lower-frequency modules summarize longer dependencies (Carta et al., 2020).
Residual selection: For vision, top- salient frames are extracted as residuals with if in TopK by . This hybrid approach balances fidelity and compression:
Coefficient selection: K frequencies are chosen to maximize reconstruction fidelity with minimal memory cost. Empirical findings indicate achieves energy retention; inclusion of top-10% residuals recovers critical semantic details (Li et al., 2 Feb 2026).
4. Integration in Streaming and Sequence Modeling Pipelines
In streaming video understanding, MFM is deployed in tandem with other memory modules such as Space Thumbnail Memory (STM). The canonical pipeline consists of:
- Sliding-window buffer: Maintains most recent short-term context.
- Multi-scale Frequency Memory: Receives evicted frames/vectors and projects them into frequency coefficients and residuals.
- STM: Provides episode-level compression orthogonal to frequency-based summary.
- Aggregation and querying: On LLM queries, the system concatenates all memory (e.g., ) and projects to a fixed token length for downstream processing (Li et al., 2 Feb 2026).
For recurrent architectures, MFM modules are added incrementally through a constructive process, with new modules initialized to optimally encode subsampled hidden activations using a linear autoencoder for sequences (LAES), followed by end-to-end stochastic gradient descent optimization (Carta et al., 2020).
5. Empirical Performance and Hyperparameter Sensitivity
Empirical results demonstrate the effectiveness of MFM modules:
| Benchmark/Task | Baseline | +MFM | Combined (FreshMem, etc.) | Metric |
|---|---|---|---|---|
| OVO-Bench (Video QA) (Li et al., 2 Feb 2026) | Qwen2-VL-7B: 52.19% | MFM only: 53.54% | SW+STM+MFM: 54.53% (+2.34%) | Avg. Acc. (%) |
| OV-Bench | 46.30% | 50.82% (+4.52%) | Avg. Acc. (%) | |
| StreamingBench | 69.00% | 74.20% (+5.20%) | Avg. Acc. (%) | |
| Audio Sequence NMSE (Carta et al., 2020) | LSTM: | MFM: $0.000116$ | NMSE | |
| TIMIT (Speech, 25k params) | LSTM: % | MFM: % | +pretrain: % | Test Acc. (%) |
| IAM-OnDB (Handwriting) | LSTM: 77.3% | MFM w/ pretrain: 66.8% | Accuracy (%) |
Hyperparameter analysis identifies:
- Optimal MFM capacity of and STM of episodes.
- Frequency bands radians, spaced logarithmically.
- Sliding-window length of frames balances motion retention and noise.
- Residual pool set to top-10% by -norm saliency.
- Update complexity per frame, with the feature dimension (Li et al., 2 Feb 2026).
6. Qualitative Analyses and Organizational Implications
Qualitative visualizations of MFM-based reconstructions reveal that DFT coefficients alone capture blurred but structurally correct global shapes; the addition of residuals sharpens object contours, enabling the recovery of "key events" from the deep past. In t-SNE projections, STM centroids form compact clusters for semantically contiguous episodes, even with substantial temporal distance.
These findings underscore MFM’s capacity to create a fixed-size memory bank with adaptive information density: recent context is stored with fine detail, while distant history is progressively compressed but not lost. This significantly alleviates the limitations of fixed short-context buffers or hard truncation strategies, and enables efficient querying and memory management in long-horizon streaming applications as well as sequence modeling in RNNs.
7. Significance, Extensions, and Perspectives
MFM provides a principled, biologically motivated solution to long-horizon information retention and retrieval, uniting frequency-domain summarization with adaptive, saliency-aware memory curation (Li et al., 2 Feb 2026, Carta et al., 2020). Its modularity and constructive training paradigm facilitate flexible adaptation across domains, including speech, vision, and online sequential decision processes.
A plausible implication is the extension of frequency-space memory architectures to transformers and other non-recurrent paradigms, as well as the incorporation of learnable frequency bases and more sophisticated saliency proxies for residual selection. The paradigm offers a bridge between raw experiential memory and semantic abstraction, mirroring theorized neural strategies for balancing short-term fidelity and long-term coherence.