MUSSE-Net: Unsupervised Strain Estimation
- The paper introduces MUSSE-Net’s multi-stage architecture, combining a USSE-Net backbone with a residual refinement stage to achieve precise, unsupervised strain estimation in ultrasound elastography.
- It leverages advanced attention-based feature fusion (CACFF and TCA) and sequential ConvLSTM decoding to enhance temporal consistency and suppress decorrelation noise.
- Experimental evaluations demonstrate significantly improved SNR, CNR, and sharper lesion boundaries on both simulated and in vivo datasets compared to baseline methods.
MUSSE-Net (Multi-Stage Residual-Aware Unsupervised Strain Estimation Network) is an end-to-end unsupervised deep learning framework specifically developed for robust, consistent strain estimation in quasi-static ultrasound strain elastography (USE). USE is a noninvasive imaging technique for assessing tissue mechanical properties, but its clinical utility remains limited by decorrelation noise, absence of ground truth, and instability of strain estimates across deformation levels. MUSSE-Net addresses these challenges via a multi-stage architecture that leverages advanced attention-based feature fusion and sequential modeling, yielding state-of-the-art results in both simulation and in vivo datasets (Joarder et al., 19 Nov 2025).
1. Network Architecture
MUSSE-Net consists of two principal components: an initial USSE-Net backbone and a subsequent residual refinement stage. The USSE-Net module is a single-stage, multi-stream encoder–decoder design featuring attention mechanisms and sequential decoding to produce dense displacement fields and axial strain maps. The secondary residual stage further refines these estimates by predicting displacement residuals following warping of the post-compression frame.
USSE-Net Backbone
Each USSE-Net stage processes one pre-compression RF frame and a sequence of post-compression RF frames , with each input tensor of size . For each pair , the network predicts a two-channel displacement field , with representing the axial displacement. Axial strain maps are then computed via least-squares differentiation of .
USSE-Net is structured around:
- Context-Aware Complementary Feature Fusion (CACFF) multi-stream encoder,
- Tri-Cross Attention (TCA) bottleneck,
- Cross-Attentive Fusion (CAF) sequential ConvLSTM decoder.
CACFF Multi-Stream Encoder
At each down-sampling level , CACFF deploys three parallel branches:
- Two unimodal streams, with shared weights, extract motion-specific features from and through four residual down-sampling blocks (each block: two 3×3 convolutions + ReLU + skip).
- A “mid” stream fuses and unimodal features, with CACFF blocks producing contextual, complementary cross-frame feature representations.
Tri-Cross Attention (TCA) Bottleneck
At the coarsest spatial scale, TCA fuses feature maps from the three streams (), projecting each to key, query, and value tensors. TCA computes global pairwise correlations (e.g., ) between all pairs, concatenates outputs, projects back to the original dimension, and additively fuses into the mid-stream value—capturing long-range dependencies, reducing decorrelation noise, and suppressing lateral artifact.
CAF Sequential ConvLSTM Decoder
Decoding unfolds in reverse levels, with ConvLSTM enforcing temporal consistency across post-compression frames. At each level and time :
- The upsampled hidden state is combined via attention over the encoder’s skip features.
- Attention weights form a fused skip .
- The concatenated vector is processed through 3×3 conv + ReLU and passed, together with , to the ConvLSTM, yielding updated hidden state .
A final 3×3 convolution maps . Displacement across all levels is aggregated and differentiated to obtain strain.
Residual Refinement Stage
MUSSE-Net stacks two USSE-Net stages (optimal ). At stage , the original and a warped post-frame are provided (with for ). The network predicts a residual displacement and computes the refined displacement , updating the warped image accordingly. Warping is preceded by 4× upsampling and followed by downsampling for sub-pixel accuracy. Only the current stage is updated during training; earlier stages are frozen.
2. Unsupervised Loss Functions
MUSSE-Net is trained entirely without ground-truth displacement, leveraging three unsupervised, physically motivated loss terms:
- Photometric (Similarity) Loss: Based on local normalized cross-correlation (LNCC) between pre-compression and warped post-compression frames.
- Displacement Smoothness Loss: Penalizes second-order spatial gradients to enforce tissue continuity.
- Temporal Consistency Loss: Minimizes LNCC difference between consecutive predicted strain maps, stabilizing estimates across deformation.
The total loss is given by: with empirically chosen weights , , . These losses are identically applied in both stages, with each residual stage trained independently.
3. Strain Calculation
Axial strain is computed from the axial displacement field using a least-squares strain estimator (LSQSE). The estimator solves
Discretized, this yields , where is a finite-difference integration matrix. The network implements this as a convolutional least-squares layer for efficient strain map generation.
4. Experimental Protocols and Quantitative Evaluation
Datasets
- Field II simulation: 23 phantoms with inclusions (18–23 kPa) and backgrounds (40–60 kPa) across 10 strain levels (0.5–4.5 %) and 10 scatter realizations. 19 phantoms for training, 4 for validation and test; post-frames per reference.
- Public in vivo dataset: 310 sequences, each 19–127 frames; 6 consecutive frames used for sequential decoding; 20 sequences for testing.
- Private BUET dataset: 23 subjects for training, 5 for testing with 10 MHz probe (40 MHz sampling); includes tissue-mimicking phantom. post-frames per reference.
Networks trained with Adam optimizer (lr ), stepwise LR scheduling, batch size 1, 150 epochs (stage 1), plus 100 epochs (residual stage).
Metrics
| Metric | Formula/Interpretation |
|---|---|
| Target SNR | (mean/std of strain in lesion) |
| Background SNR | (mean/std of strain in background) |
| Contrast-to-Noise Ratio (CNR) | |
| Elastographic SNR | (global mean/std for strain map) |
| NRMSE |
Main Results
| Model | SNR | SNR | CNR | NRMSE | SNR |
|---|---|---|---|---|---|
| USENet (baseline) | 13.66 ± 1.75 | 48.15 ± 7.27 | 20.98 ± 3.62 | 29.35 ± 0.77 % | 4.43 ± 0.35 |
| ReUSENet (ConvLSTM) | 14.64 ± 2.23 | 68.43 ± 18.36 | 30.33 ± 8.06 | 2.03 ± 0.04 % | 7.50 ± 0.58 |
| USSE-Net | 16.69 ± 3.52 | 102.25 ± 33.36 | 43.11 ± 14.15 | 1.12 ± 0.12 % | 9.16 ± 0.80 |
| MUSSE-Net | 24.54 ± 3.66 | 132.76 ± 45.63 | 59.81 ± 20.38 | 1.31 ± 0.06 % | 9.73 ± 1.08 |
For the public in vivo test set, mean SNR values were 0.81 (USENet), 0.96 (ReUSENet), 0.97 (USSE-Net), and 0.99 (MUSSE-Net). On the private BUET dataset and phantom, MUSSE-Net demonstrated consistently sharper lesion boundaries, higher contrast, and suppressed decorrelation noise relative to previous methods. Stability analyses confirm that MUSSE-Net maintains metrics such as SNR, SNR, and CNR even at high strain (4.5 %), where earlier models degrade.
5. Clinical Significance and Interpretability
The integration of CACFF (for contextual fusion), TCA (for cross-modal attention), and CAF–ConvLSTM (for temporal coherence) enables MUSSE-Net to generate axial strain maps closely reflecting true tissue mechanics. Lesion areas appear with well-defined, reproducible boundaries and contrast, while background noise and artifacts are significantly diminished. The temporal consistency loss ensures that strain estimates remain stable under varying deformations, a requirement for clinical USE applications such as breast lesion assessment and liver fibrosis staging.
6. Limitations and Future Directions
Despite its superior quantitative and qualitative performance, MUSSE-Net exhibits increased training and inference costs due to its multi-stage construction and use of ConvLSTM cells, and currently requires batch size 1 for training. Future research aims include developing lightweight variants, leveraging network pruning, exploring alternative temporal modeling strategies to permit larger batch sizes, and undertaking broader validation across multi-center clinical datasets. These directions are intended to address scalability and generalizability for widespread clinical adoption (Joarder et al., 19 Nov 2025).