Memory-Based Wavelet Convolution

Updated 24 December 2025

Memory-Based Wavelet Convolution is a feature extraction method combining wavelet transforms with memory modules to enhance fine-grained ultrasound video segmentation.
It isolates high-frequency details using cascaded discrete wavelet transforms while leveraging a hybrid memory bank with ConvLSTM and cross-attention for temporal consistency.
The approach improves boundary preservation and small-object tracking, demonstrating significant performance gains on benchmark medical segmentation datasets.

Memory-Based Wavelet Convolution defines a feature extraction and fusion paradigm within deep neural architectures—most notably illustrated by MWNet—for the fine-grained segmentation of medical ultrasound videos. The approach leverages discrete wavelet transforms to isolate and integrate high-frequency (HF) spatial features at multilayer resolutions, while temporal coherence across frames is captured using a hybrid memory bank mechanism combining short-range ConvLSTM and long-range cross-attention. This design systematically addresses the fundamental challenges in longitudinal ultrasound video object segmentation: precise boundary location preservation under low-contrast, speckled conditions, and robust small-object tracking over extended temporal windows (Zhang et al., 17 Dec 2025).

1. Structural Overview of MWNet

MWNet implements a U-Net–inspired encoder–decoder topology composed of three encoder stages, a bottleneck memory bank, and three fusion-decoder stages. Each encoder stage is based on a Memory-Based Wavelet Convolution (MWConv) block, which hierarchically decomposes spatial features via cascaded discrete wavelet transform (WT), then fuses and reconstructs them through pointwise convolutions and inverse WT recursions.

Temporal modeling per encoder scale is realized through a lightweight ConvLSTM, providing short-range inter-frame smoothing. The deepest encoder output is concatenated and processed through a two-tier memory bank at the bottleneck: local ConvLSTM features are fused with compressed global key features, the latter governed by a cross-attention mechanism and a redundancy-controlled update rule. The decoder utilizes cascaded High-Frequency–Aware Feature Fusion (HFF) modules, which upsample and integrate HF content from the matching encoder stage using adaptive wavelet filtering.

A summary table of the pipeline components follows.

Module	Function	Key Mechanism
MWConv	Multiscale spatial detail extraction & fusion	Cascaded WT/IWT, pointwise conv, depthwise conv
Temporal Fusion	Short-range temporal smoothing	ConvLSTM over input clip
Memory Bank (Bottleneck)	Long/short-term global sequence tracking	Cross-attention, similarity-based update
HFF Decoder	HF-preserving upsampling and feature refinement	Adaptive wavelet filters, SE weighting

2. Memory-Based Wavelet Convolution (MWConv)

At encoder stage $l$ , given input features $X^{(l)}\in\mathbb{R}^{C\times H\times W}$ , a 2D one-level DWT generates sub-bands:

$(X^{(l,1)}_{LL}, X^{(l,1)}_{LH}, X^{(l,1)}_{HL}, X^{(l,1)}_{HH}) = \mathrm{WT}(X^{(l)})$

$X^{(l,1)}_H = [X^{(l,1)}_{LH}, X^{(l,1)}_{HL}, X^{(l,1)}_{HH}]$

Subsequent cascaded WT operations on $LL$ produce deeper multiresolution features, controlled so the coarsest $LL$ is $1/4$ input resolution. Convolutions (pointwise or $k\times k$ ) expand the receptive field and reduce channels for both low-frequency (LF) and HF paths:

$X^{(l,k+1)}_{LL^1} = \mathrm{Conv}(X^{(l,k+1)}_{LL}), \quad X^{(l,k+1)}_{H^1} = \mathrm{Conv}(X^{(l,k+1)}_{H})$

Feature merging back to full resolution is performed by recursively applying inverse WT:

$X^{(l,k-1)}_{LL^2} = \mathrm{IWT}\bigl(X^{(l,k)}_{LL^2}, X^{(l,k)}_{H^1}\bigr) + X^{(l,k-1)}_{LL^1}$

with

$X^{(l)}_1 = \mathrm{IWT}\bigl(X^{(l,1)}_{LL^2}, X^{(l,1)}_{H^1}\bigr) + \mathrm{DWConv}(X^{(l)})$

This enables explicit capture and fusion of multiscale HF and LF features and augments the effective receptive field at each encoder stage with minimal additional memory overhead. The “wavelet compression” mechanism thus achieves rigorous spatial feature decomposition and efficient multi-resolution aggregation (Zhang et al., 17 Dec 2025).

3. Temporal and Long-Term Memory Fusion

Each encoder output leverages temporal fusion through ConvLSTM applied to recent feature sequences, updating hidden state $h_{t+1,l}$ :

$(h_{t-k+1,l}, c_{t-k+1,l}) = \mathrm{ConvLSTM}(f_{t-k,l}, h_{t-k,l}, c_{t-k,l}), \quad k=t_0, \ldots, 0$

At the bottleneck, features are aggregated into two memory banks: a short-term queue ( $T_s$ ) and a long-term queue ( $T_\ell$ ), each storing compressed featureframes. Cross-attention is computed as:

$f_\ell = \mathrm{Softmax}(Q_t K_\ell^T/\sqrt{C}) V_\ell$

with $Q_t = W_Q f_{t,4}$ , $K_\ell = W_K M_\ell$ , $V_\ell = W_V M_\ell$ .

This produces a refined bottleneck representation incorporating both recent and salient global context, critical for mitigating drift and error propagation in long-sequence tracking. The long-term memory update leverages pairwise cosine similarity to maintain feature diversity and relevance, discarding newly redundant keys above a predefined threshold.

4. High-Frequency–Aware Feature Fusion in Decoding

Decoder stage $l$ receives a coarser upsampled feature $Z^{(l+1)}$ and the corresponding encoder skip $X^{(l)}$ , first aligning them in spatial size and channels. Adaptive low-pass ( $\mathcal{F}^L$ ) and high-pass ( $\mathcal{F}^H$ ) wavelet filters are separately applied and fused:

$\tilde Z = \mathcal F^{\mathrm{UP}}(\mathrm{Conv}(Z^{(l+1)})), \quad \tilde X = \mathrm{Conv}(X^{(l)})$

$Y^{(l)} = \mathcal F^L(\tilde Z) + \mathcal F^H(\tilde X)$

$Z_l = \mathrm{Conv}(\mathcal F^L(Y^{(l)})) + \mathrm{Conv}(\mathcal F^H(Y^{(l)}))$

The adaptive 2D wavelet filtering is realized via a lifting scheme with learned SE-based sub-band reweighting, designed to favor $LL$ in $\mathcal{F}^L$ and $LH/HL/HH$ in $\mathcal{F}^H$ . The process yields robust HF preservation for boundary structures, markedly improving against speckle-induced degradations and enhancing delineation of small objects.

5. Training Protocols and Dataset Evaluation

MWNet is trained with a composite loss:

$\mathcal{L} = \lambda_\mathrm{Dice}\, \mathcal{L}_\mathrm{Dice}(P, G) + \lambda_\mathrm{BCE}\, \mathcal{L}_\mathrm{BCE}(P, G)$

where

$\mathcal{L}_\mathrm{Dice} = 1 - \frac{2\sum_i P_i G_i + \varepsilon}{\sum_i P_i + \sum_i G_i + \varepsilon}$

$\mathcal{L}_\mathrm{BCE} = -\frac{1}{N} \sum_i \left[G_i \log P_i + (1-G_i)\log(1-P_i)\right]$

and $\lambda_\mathrm{Dice} = \lambda_\mathrm{BCE} = 1$ .

The optimizer is AdamW with learning rate $10^{-4}$ ; PolyLR scheduler and LinearLR warm-up are employed with layer-wise decay for a total of 180,000 iterations (batch = 2). Pretraining uses an ImageNet-1k pretrained WTConv backbone. Data augmentation consists of random blur, flips, and color jitter with the same seed across each 10-frame clip. Input sequences are 10 frames, resized to $512\times512$ .

Datasets span four benchmarks:

Thyroid Nodule (64 videos, 5,611 frames, split 40/10/14)
VTUS (100 videos, 9,342 frames, 7:3 split)
TG3K Thyroid Gland (16 videos, 3,585 frames, 12/2/2 split)
CAMUS Echocardiography (1,000 sequences, 700/100/200 split)

Evaluation metrics include Dice (DSC), IoU, MAE, precision, and recall. Quantitative improvements over state-of-the-art video segmentation methods are observed in all categories. Notably, on the Thyroid Nodule set, MWNet achieves DSC 88.03% (+2.45%), IoU 78.61% (+3.82%), and MAE 0.0101 (−0.0008). On the CAMUS set, it reaches DSC 94.34% (+0.15%) and IoU 89.29% (+0.27%). Qualitative assessment shows sharper and more contiguous lesion boundaries and reduced false positives on speckle artifacts, with stable mask predictions over tens of frames.

6. Significance and Core Innovations

The Memory-Based Wavelet Convolution approach establishes several key methodological contributions:

The wavelet-based encoder (WTConv) integrates explicit HF detail extraction with progressive receptive field enlargement through hierarchical $LL$ decompositions and inverse merges.
The combination of local ConvLSTM and global cross-attention in the memory bank, with redundancy-controlled updates, enables robust small-object tracking through long, unannotated video sequences.
The HFF decoding mechanism, built around adaptive wavelet sub-band fusion and channel attention, yields resilience against boundary noise and surpasses conventional skip or upsampling techniques for HF detail retention.
The foundational premise—wavelets' natural separation of spatial frequencies—underpins targeted feature fusion and boundary preservation, while the hybrid memory model addresses the nontrivial challenges of camera motion and tissue deformation intrinsic to long ultrasound clips.

Empirically, the framework demonstrates superior segmentation accuracy (IoU improvements of 1.5–3.8% over SOTA ConvNet/Transformer variants on thyroid datasets), maintaining near real-time inference speeds on contemporary hardware (~28 ms/frame, RTX-4090) (Zhang et al., 17 Dec 2025).

A plausible implication is that this design’s fusion of frequency-domain analysis and hierarchical memory architectures may extend to other domains with challenging HF structure and temporal dependencies, where object persistence under distortion or occlusion is required.

PDF Markdown Chat (Pro)

References (1)

Tracking spatial temporal details in ultrasound long video via wavelet analysis and memory bank (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Memory-Based Wavelet Convolution.