Memory Bank Wavelet Filtering and Fusion Network
- The paper introduces MWNet, a network that combines multi-stage wavelet decompositions, ConvLSTM temporal fusion, and a long-short-term memory bank for robust ultrasound video segmentation.
- The architecture replaces standard convolutions with WTConv blocks and adaptive high-frequency filtering to accurately capture spatial details and mitigate speckle noise.
- Experimental results demonstrate significant improvements in Dice and IoU metrics on clinical datasets, underscoring the network’s effectiveness in challenging segmentation tasks.
Memory bank-based wavelet filtering and fusion networks constitute a class of encoder-decoder architectures specifically designed for temporally and spatially resolved segmentation of medical ultrasound videos, with robustness to low contrast and heavy speckle noise. MWNet, a state-of-the-art instantiation, couples cascaded discrete wavelet filtering, memory-based spatial-temporal aggregation, and high-frequency–aware feature fusion, yielding enhanced performance on challenging long video segmentation and object tracking benchmarks in clinical imaging contexts (Zhang et al., 17 Dec 2025).
1. Architectural Foundations
MWNet adopts a U-Net–inspired encoder–decoder structure but systematically replaces standard convolutions in both encoding and decoding paths with wavelet-based modules termed WTConv. Each stage in the encoder leverages WTConv to perform multi-resolution wavelet decompositions, followed by fine-grained spatial convolutions and iterative inverse transforms. Temporal dependencies are further aggregated via ConvLSTM blocks, and a specialized long-short-term memory bank is introduced at the bottleneck for advanced temporal feature reuse and cross-frame consistency. The decoder integrates adaptive wavelet filtering at each upsampling stage to reinforce boundary-sensitive high-frequency (HF) details, critical for accurate segmentation in noisy and low-contrast video sequences.
2. Encoder: Memory-Based Wavelet Convolution and Temporal Aggregation
The encoding path comprises four hierarchical stages (indexed by ), with each stage processing frame-wise feature tensors .
2.1 WTConv Block
- Multi-Stage Wavelet Decomposition: The input feature map undergoes cascaded 2D discrete wavelet transform (DWT), yielding multiscale subbands for . Small-kernel () convolutions are applied individually to the low- and high-frequency subbands at each scale, facilitating localized frequency-domain feature extraction.
- Progressive Inverse Reconstruction: Features are iteratively fused via inverse wavelet transforms (IWTs), and skip connections sum the reconstructed subbands with intermediate projections.
- Final Fusion: The last-level outputs are merged via depthwise convolution:
becomes the spatial representation exposed to both temporal and memory-based operations.
2.2 Temporal Fusion via ConvLSTM
At each encoder stage, a ConvLSTM aggregates features from the current and prior frames over a window ( for inputs of 10 frames), producing temporally fused features that encode both spatial and local temporal context. This configuration robustly captures rapid motion, e.g., cardiac pulsation, and preserves spatiotemporal fidelity of object boundaries.
3. Long-Short-Term Memory Bank and Fusion
After the deepest encoder stage, MWNet integrates a composite memory mechanism:
- Short-Term Memory Bank (): A FIFO queue of capacity retains the most recent compressed bottleneck features.
- Long-Term Memory Bank (): A memory pool of capacity houses historical feature vectors. Redundant entries are evicted using pairwise cosine similarity, culling highly similar states to maintain feature diversity:
When , entries with mean similarity above a threshold are removed.
- Cross-Attention Reading: Current features , concatenated banks , and respective projections parameterize scaled dot-product cross-attention:
- Feature Fusion: The outputs , , and are projected and concatenated, followed by a two-layer MLP (FFN with ReLU activation), yielding reintroduced into the decoder path.
4. Decoder and High-Frequency-Aware Feature Fusion
Each decoder stage receives upsampled output from the previous layer and aligned encoder skip connection . The HF-aware feature fusion (HFF) module comprises:
- Adaptive Wavelet Filters: Channel projection and bilinear upsampling operations precede application of adaptive low-pass () and high-pass () filters, implemented via the lifting scheme. For a feature map , the decomposition
is followed by squeeze-and-excitation channel attention, emphasizing subbands as per the targeted spatial frequency (LL for AWLF, LH/HL/HH for AWHF).
- Fusion & Refinement:
This enables the decoder to recover complex high-frequency boundaries and anatomy.
A final convolution projects to a full-resolution segmentation mask, with sigmoid activation to obtain per-pixel probabilities.
5. Mathematical Principles and Loss Functions
The network is grounded in multiscale wavelet theory and temporal memory mechanisms:
- Wavelet Decomposition: Employs classical DWT and IWT following Mallat’s filter-bank formalism for multiresolution analysis.
- Memory Update Rules: As above, the short-term memory bank functions as a FIFO buffer, while the long-term memory applies redundancy pruning by similarity thresholding.
- Loss Functions: Training uses a balanced sum of Dice and binary cross-entropy losses:
with , .
6. Implementation and Training Protocols
Key implementation specifics:
- WTConv encoder is pretrained on ImageNet-1k for 300 epochs.
- Each input comprises a 10-frame stack, each frame resized to pixels.
- Temporal fusion via ConvLSTM over preceding frames.
- Long/short memory bank sizes: , .
- Optimizer: AdamW (learning rate , weight decay ); PolyLR schedule with linear warmup (1,000 iterations), layer-wise decay 0.9.
- Batch size: 2, 180,000 iterations, single RTX 4090; inference speed: ~28 ms/frame.
7. Experimental Evaluation
7.1 Quantitative Results
MWNet establishes superior performance on four ultrasound datasets (thyroid nodule, VTUS, TG3K, CAMUS echocardiography). Key metrics include Dice, IoU, and MAE. Representative results:
| Dataset | Dice (%) | IoU (%) | MAE | Params (M) | FPS |
|---|---|---|---|---|---|
| Thyroid Nodule | 88.03 | 78.61 | 0.0101 | 72.99 | 28 |
| VTUS | 87.68 | 78.06 | 0.0208 | — | — |
| TG3K | 87.90 | 78.41 | 0.0258 | — | — |
| CAMUS (Echo) | 94.34 | 89.29 | 0.0109 | — | — |
Paired t-tests confirm statistically significant IoU improvements over state-of-the-art comparators ().
7.2 Ablation Study
An incremental module-insertion ablation is performed:
| WTConv | Memory | HFF | Dice (%) | IoU (%) |
|---|---|---|---|---|
| ✗ | ✗ | ✗ | 82.63 | 70.40 |
| ✓ | ✗ | ✗ | 84.46 | 73.10 |
| ✓ | ✓ | ✗ | 87.21 | 77.32 |
| ✓ | ✓ | ✓ | 88.03 | 78.61 |
Each architectural innovation confers a measurable performance gain (WTConv: +1.8–2.0% Dice, memory: +3.0–3.3% Dice, HFF: +0.3–0.8% Dice).
7.3 Qualitative and Robustness Results
Visualization demonstrates improved boundary adherence and speckle denoising, with the long-term memory bank reducing inter-frame segmentation jitter and enhancing tracking stability over sequences up to 60 frames. Performance is stable across a range of section lengths and memory bank sizes (peak at ).
8. Significance and Implications
The memory bank-based wavelet filtering and fusion framework leverages multiscale frequency-domain decomposition, distributed temporal memory, and adaptive HF fusion to address key segmentation challenges in ultrasonography: low contrast, speckle-dominated backgrounds, and small-object tracking over extended video intervals. MWNet demonstrates robust tracking and boundary precision exceeding existing video and image SOTA architectures, particularly under the demands of small lesion segmentation and persistent object identification in lengthy medical sequences (Zhang et al., 17 Dec 2025). A plausible implication is the suitability of this design in broader video segmentation domains that require explicit frequency- and memory-aware mechanisms.