Memory Bank Wavelet Filtering and Fusion Network

Updated 24 December 2025

The paper introduces MWNet, a network that combines multi-stage wavelet decompositions, ConvLSTM temporal fusion, and a long-short-term memory bank for robust ultrasound video segmentation.
The architecture replaces standard convolutions with WTConv blocks and adaptive high-frequency filtering to accurately capture spatial details and mitigate speckle noise.
Experimental results demonstrate significant improvements in Dice and IoU metrics on clinical datasets, underscoring the network’s effectiveness in challenging segmentation tasks.

Memory bank-based wavelet filtering and fusion networks constitute a class of encoder-decoder architectures specifically designed for temporally and spatially resolved segmentation of medical ultrasound videos, with robustness to low contrast and heavy speckle noise. MWNet, a state-of-the-art instantiation, couples cascaded discrete wavelet filtering, memory-based spatial-temporal aggregation, and high-frequency–aware feature fusion, yielding enhanced performance on challenging long video segmentation and object tracking benchmarks in clinical imaging contexts (Zhang et al., 17 Dec 2025).

1. Architectural Foundations

MWNet adopts a U-Net–inspired encoder–decoder structure but systematically replaces standard convolutions in both encoding and decoding paths with wavelet-based modules termed WTConv. Each stage in the encoder leverages WTConv to perform multi-resolution wavelet decompositions, followed by fine-grained spatial convolutions and iterative inverse transforms. Temporal dependencies are further aggregated via ConvLSTM blocks, and a specialized long-short-term memory bank is introduced at the bottleneck for advanced temporal feature reuse and cross-frame consistency. The decoder integrates adaptive wavelet filtering at each upsampling stage to reinforce boundary-sensitive high-frequency (HF) details, critical for accurate segmentation in noisy and low-contrast video sequences.

2. Encoder: Memory-Based Wavelet Convolution and Temporal Aggregation

The encoding path comprises four hierarchical stages (indexed by $l=1\ldots4$ ), with each stage processing frame-wise feature tensors $X^{(l)} \in \mathbb{R}^{c_l\times(H/2^{l-1})\times(W/2^{l-1})}$ .

2.1 WTConv Block

Multi-Stage Wavelet Decomposition: The input feature map undergoes cascaded 2D discrete wavelet transform (DWT), yielding multiscale subbands $\{X_{LL}^{(l,k)}, X_{H}^{(l,k)}\}$ for $k=1\ldots(6-l)$ . Small-kernel ( $3\times3$ ) convolutions are applied individually to the low- and high-frequency subbands at each scale, facilitating localized frequency-domain feature extraction.
Progressive Inverse Reconstruction: Features are iteratively fused via inverse wavelet transforms (IWTs), and skip connections sum the reconstructed subbands with intermediate projections.
Final Fusion: The last-level outputs are merged via depthwise convolution:

$X_1^{(l)} = \text{IWT}(X_{LL}^{2,(l,1)}, X_H^{1,(l,1)}) + \text{DWConv}(X^{(l)})$

$X_1^{(l)}$ becomes the spatial representation exposed to both temporal and memory-based operations.

2.2 Temporal Fusion via ConvLSTM

At each encoder stage, a ConvLSTM aggregates features from the current and $t_0$ prior frames over a window ( $t_0=9$ for inputs of 10 frames), producing temporally fused features $X_l$ that encode both spatial and local temporal context. This configuration robustly captures rapid motion, e.g., cardiac pulsation, and preserves spatiotemporal fidelity of object boundaries.

3. Long-Short-Term Memory Bank and Fusion

After the deepest encoder stage, MWNet integrates a composite memory mechanism:

Short-Term Memory Bank ( $S$ ): A FIFO queue of capacity $s=2$ retains the most recent compressed bottleneck features.
Long-Term Memory Bank ( $L$ ): A memory pool of capacity $m=5$ houses historical feature vectors. Redundant entries are evicted using pairwise cosine similarity, culling highly similar states to maintain feature diversity:

$\text{sim}(i,j) = \frac{L_i \cdot L_j}{\|L_i\|\|L_j\|}$

When $|L|>m$ , entries with mean similarity above a threshold $\tau=0.9$ are removed.

Cross-Attention Reading: Current features $f_{t,4}$ , concatenated banks $M_l,M_s$ , and respective projections $(Q_t,K_l,V_l,K_s,V_s)$ parameterize scaled dot-product cross-attention:

$A_l = \mathrm{Softmax}\left(\frac{Q_t K_l^T }{\sqrt{d}} \right),\quad f_l = A_l V_l$

$A_s = \mathrm{Softmax}\left(\frac{Q_t K_s^T }{\sqrt{d}} \right),\quad f_s = A_s V_s$

Feature Fusion: The outputs $f_t$ , $f_l$ , and $f_s$ are projected and concatenated, followed by a two-layer MLP (FFN with ReLU activation), yielding $f_\text{mem}$ reintroduced into the decoder path.

4. Decoder and High-Frequency-Aware Feature Fusion

Each decoder stage $l=4\ldots1$ receives upsampled output from the previous layer $Z^{(l+1)}$ and aligned encoder skip connection $X^{(l)}$ . The HF-aware feature fusion (HFF) module comprises:

Adaptive Wavelet Filters: Channel projection and bilinear upsampling operations precede application of adaptive low-pass ( $F^L$ ) and high-pass ( $F^H$ ) filters, implemented via the lifting scheme. For a feature map $Y$ , the decomposition

$\{Y_{LL},Y_{LH},Y_{HL},Y_{HH}\} = AWT_{2D}(Y)$

is followed by squeeze-and-excitation channel attention, emphasizing subbands as per the targeted spatial frequency (LL for AWLF, LH/HL/HH for AWHF).

Fusion & Refinement:

$Y^{(l)} = F^L(\text{Upsample}(\text{Conv}(Z^{(l+1)}))) + F^H(\text{Conv}(X^{(l)}))$

$Z^{(l)} = \text{Conv}(F^L(Y^{(l)})) + \text{Conv}(F^H(Y^{(l)}))$

This enables the decoder to recover complex high-frequency boundaries and anatomy.

A final $1\times1$ convolution projects to a full-resolution segmentation mask, with sigmoid activation to obtain per-pixel probabilities.

5. Mathematical Principles and Loss Functions

The network is grounded in multiscale wavelet theory and temporal memory mechanisms:

Wavelet Decomposition: Employs classical DWT and IWT following Mallat’s filter-bank formalism for multiresolution analysis.
Memory Update Rules: As above, the short-term memory bank functions as a FIFO buffer, while the long-term memory applies redundancy pruning by similarity thresholding.
Loss Functions: Training uses a balanced sum of Dice and binary cross-entropy losses:

$L = \alpha L_\text{Dice} + \beta L_\text{BCE}$

with $\alpha=\beta=1$ , $\epsilon=10^{-6}$ .

6. Implementation and Training Protocols

Key implementation specifics:

WTConv encoder is pretrained on ImageNet-1k for 300 epochs.
Each input comprises a 10-frame stack, each frame resized to $512 \times 512$ pixels.
Temporal fusion via ConvLSTM over $t_0=9$ preceding frames.
Long/short memory bank sizes: $t_l=5$ , $t_s=2$ .
Optimizer: AdamW (learning rate $1\times10^{-4}$ , weight decay $1\times10^{-2}$ ); PolyLR schedule with linear warmup (1,000 iterations), layer-wise decay 0.9.
Batch size: 2, 180,000 iterations, single RTX 4090; inference speed: ~28 ms/frame.

7. Experimental Evaluation

7.1 Quantitative Results

MWNet establishes superior performance on four ultrasound datasets (thyroid nodule, VTUS, TG3K, CAMUS echocardiography). Key metrics include Dice, IoU, and MAE. Representative results:

Dataset	Dice (%)	IoU (%)	MAE	Params (M)	FPS
Thyroid Nodule	88.03	78.61	0.0101	72.99	28
VTUS	87.68	78.06	0.0208	—	—
TG3K	87.90	78.41	0.0258	—	—
CAMUS (Echo)	94.34	89.29	0.0109	—	—

Paired t-tests confirm statistically significant IoU improvements over state-of-the-art comparators ( $p<0.05$ ).

7.2 Ablation Study

An incremental module-insertion ablation is performed:

WTConv	Memory	HFF	Dice (%)	IoU (%)
✗	✗	✗	82.63	70.40
✓	✗	✗	84.46	73.10
✓	✓	✗	87.21	77.32
✓	✓	✓	88.03	78.61

Each architectural innovation confers a measurable performance gain (WTConv: +1.8–2.0% Dice, memory: +3.0–3.3% Dice, HFF: +0.3–0.8% Dice).

7.3 Qualitative and Robustness Results

Visualization demonstrates improved boundary adherence and speckle denoising, with the long-term memory bank reducing inter-frame segmentation jitter and enhancing tracking stability over sequences up to 60 frames. Performance is stable across a range of section lengths and memory bank sizes (peak at $t_l=5$ ).

8. Significance and Implications

The memory bank-based wavelet filtering and fusion framework leverages multiscale frequency-domain decomposition, distributed temporal memory, and adaptive HF fusion to address key segmentation challenges in ultrasonography: low contrast, speckle-dominated backgrounds, and small-object tracking over extended video intervals. MWNet demonstrates robust tracking and boundary precision exceeding existing video and image SOTA architectures, particularly under the demands of small lesion segmentation and persistent object identification in lengthy medical sequences (Zhang et al., 17 Dec 2025). A plausible implication is the suitability of this design in broader video segmentation domains that require explicit frequency- and memory-aware mechanisms.

PDF Markdown Chat (Pro)

References (1)

Tracking spatial temporal details in ultrasound long video via wavelet analysis and memory bank (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Memory Bank-Based Wavelet Filtering and Fusion Network.