MUSSE-Net: Unsupervised Strain Estimation

Updated 26 November 2025

The paper introduces MUSSE-Net’s multi-stage architecture, combining a USSE-Net backbone with a residual refinement stage to achieve precise, unsupervised strain estimation in ultrasound elastography.
It leverages advanced attention-based feature fusion (CACFF and TCA) and sequential ConvLSTM decoding to enhance temporal consistency and suppress decorrelation noise.
Experimental evaluations demonstrate significantly improved SNR, CNR, and sharper lesion boundaries on both simulated and in vivo datasets compared to baseline methods.

MUSSE-Net (Multi-Stage Residual-Aware Unsupervised Strain Estimation Network) is an end-to-end unsupervised deep learning framework specifically developed for robust, consistent strain estimation in quasi-static ultrasound strain elastography (USE). USE is a noninvasive imaging technique for assessing tissue mechanical properties, but its clinical utility remains limited by decorrelation noise, absence of ground truth, and instability of strain estimates across deformation levels. MUSSE-Net addresses these challenges via a multi-stage architecture that leverages advanced attention-based feature fusion and sequential modeling, yielding state-of-the-art results in both simulation and in vivo datasets (Joarder et al., 19 Nov 2025).

1. Network Architecture

MUSSE-Net consists of two principal components: an initial USSE-Net backbone and a subsequent residual refinement stage. The USSE-Net module is a single-stage, multi-stream encoder–decoder design featuring attention mechanisms and sequential decoding to produce dense displacement fields and axial strain maps. The secondary residual stage further refines these estimates by predicting displacement residuals following warping of the post-compression frame.

USSE-Net Backbone

Each USSE-Net stage processes one pre-compression RF frame $I_{\mathrm{pre}}$ and a sequence of $T$ post-compression RF frames $\{I^t_{\mathrm{post}}\}_{t=1}^T$ , with each input tensor of size $1 \times H \times W$ . For each pair $(I_{\mathrm{pre}}, I^t_{\mathrm{post}})$ , the network predicts a two-channel displacement field $D^t = (D^t_x, D^t_y)$ , with $D^t_y$ representing the axial displacement. Axial strain maps $z^t$ are then computed via least-squares differentiation of $D^t_y$ .

USSE-Net is structured around:

Context-Aware Complementary Feature Fusion (CACFF) multi-stream encoder,
Tri-Cross Attention (TCA) bottleneck,
Cross-Attentive Fusion (CAF) sequential ConvLSTM decoder.

CACFF Multi-Stream Encoder

At each down-sampling level $l$ , CACFF deploys three parallel branches:

Two unimodal streams, with shared weights, extract motion-specific features from $I_{\mathrm{pre}}$ and $I^t_{\mathrm{post}}$ through four residual down-sampling blocks (each block: two 3×3 convolutions + ReLU + skip).
A “mid” stream fuses $[I_{\mathrm{pre}}, I^t_{\mathrm{post}}]$ and unimodal features, with CACFF blocks producing contextual, complementary cross-frame feature representations.

Tri-Cross Attention (TCA) Bottleneck

At the coarsest spatial scale, TCA fuses feature maps from the three streams ( $f^{t,L}_{\mathrm{pre}}, f^{t,L}_{\mathrm{post}}, f^{t,L}_{\mathrm{mid}}$ ), projecting each to key, query, and value tensors. TCA computes global pairwise correlations (e.g., $A_{pre\to mid} = \text{softmax}(Q_{\mathrm{mid}} K_{\mathrm{pre}}^\mathsf{T})$ ) between all pairs, concatenates outputs, projects back to the original dimension, and additively fuses into the mid-stream value—capturing long-range dependencies, reducing decorrelation noise, and suppressing lateral artifact.

CAF Sequential ConvLSTM Decoder

Decoding unfolds in reverse $L$ levels, with ConvLSTM enforcing temporal consistency across post-compression frames. At each level $l$ and time $t$ :

The upsampled hidden state $\uparrow h^t_{l-1}$ is combined via attention over the encoder’s skip features.
Attention weights $\alpha = \text{softmax}(W_a[\uparrow h^t_{l-1};\,\mathrm{skips}]+b_a)$ form a fused skip $\widetilde f^t_l = \sum \alpha_i\,\mathrm{skips}_i$ .
The concatenated vector is processed through 3×3 conv + ReLU and passed, together with $h^{t-1}_l$ , to the ConvLSTM, yielding updated hidden state $h^t_l$ .

A final 3×3 convolution maps $h^t_L \to D^t$ . Displacement across all levels is aggregated and differentiated to obtain strain.

MUSSE-Net stacks two USSE-Net stages (optimal $M_{\mathrm{opt}} = 2$ ). At stage $m$ , the original $I_{\mathrm{pre}}$ and a warped post-frame $I^{t,m-1}_{\mathrm{post}}$ are provided (with $I^{t,0}_{\mathrm{post}} = I^t_{\mathrm{post}}$ for $m=1$ ). The network predicts a residual displacement $D^{t,m}_{\mathrm{res}}$ and computes the refined displacement $D^{t,m} = D^{t,m-1} + D^{t,m}_{\mathrm{res}}$ , updating the warped image accordingly. Warping is preceded by 4× upsampling and followed by downsampling for sub-pixel accuracy. Only the current stage is updated during training; earlier stages are frozen.

2. Unsupervised Loss Functions

MUSSE-Net is trained entirely without ground-truth displacement, leveraging three unsupervised, physically motivated loss terms:

Photometric (Similarity) Loss: Based on local normalized cross-correlation (LNCC) between pre-compression and warped post-compression frames.
Displacement Smoothness Loss: Penalizes second-order spatial gradients to enforce tissue continuity.
Temporal Consistency Loss: Minimizes LNCC difference between consecutive predicted strain maps, stabilizing estimates across deformation.

The total loss is given by: $L_{\mathrm{total}} = \alpha\,L_{\mathrm{sim}} + \beta\,L_{\mathrm{con}} + \gamma\,L_{\mathrm{smooth}}$ with empirically chosen weights $\alpha=1.0$ , $\beta=0.2$ , $\gamma=0.3$ . These losses are identically applied in both stages, with each residual stage trained independently.

3. Strain Calculation

Axial strain is computed from the axial displacement field $D_y(x, y)$ using a least-squares strain estimator (LSQSE). The estimator solves

$z(x, y) = \arg\min_z \int (D_y(x, y) - \int_0^x z(s, y) ds )^2 dx$

Discretized, this yields $z = (M^\mathsf{T}M)^{-1}M^\mathsf{T}D_y$ , where $M$ is a finite-difference integration matrix. The network implements this as a convolutional least-squares layer for efficient strain map generation.

4. Experimental Protocols and Quantitative Evaluation

Datasets

Field II simulation: 23 phantoms with inclusions (18–23 kPa) and backgrounds (40–60 kPa) across 10 strain levels (0.5–4.5 %) and 10 scatter realizations. 19 phantoms for training, 4 for validation and test; $T=9$ post-frames per reference.
Public in vivo dataset: 310 sequences, each 19–127 frames; 6 consecutive frames used for sequential decoding; 20 sequences for testing.
Private BUET dataset: 23 subjects for training, 5 for testing with 10 MHz probe (40 MHz sampling); includes tissue-mimicking phantom. $T=5$ post-frames per reference.

Networks trained with Adam optimizer (lr $= 1 \times 10^{-3}$ ), stepwise LR scheduling, batch size 1, 150 epochs (stage 1), plus 100 epochs (residual stage).

Metrics

Metric	Formula/Interpretation
Target SNR	${\rm SNR}_t = \bar s_t/\sigma_t$ (mean/std of strain in lesion)
Background SNR	${\rm SNR}_{bg} = \bar s_b/\sigma_b$ (mean/std of strain in background)
Contrast-to-Noise Ratio (CNR)	$\sqrt{2(\bar s_b-\bar s_t)^2 / (\sigma_b^2+\sigma_t^2)}$
Elastographic SNR	${\rm SNR}_e = \bar s/\sigma$ (global mean/std for strain map)
NRMSE	$100\times \sqrt{\frac{1}{N}\sum_i(w^{\mathrm{GT}}_i - w^\theta_i)^2} / \sqrt{\frac{1}{N}\sum_i(w^{\mathrm{GT}}_i)^2}$

Main Results

Model	SNR $_t$	SNR $_{bg}$	CNR	NRMSE	SNR $_e$
USENet (baseline)	13.66 ± 1.75	48.15 ± 7.27	20.98 ± 3.62	29.35 ± 0.77 %	4.43 ± 0.35
ReUSENet (ConvLSTM)	14.64 ± 2.23	68.43 ± 18.36	30.33 ± 8.06	2.03 ± 0.04 %	7.50 ± 0.58
USSE-Net	16.69 ± 3.52	102.25 ± 33.36	43.11 ± 14.15	1.12 ± 0.12 %	9.16 ± 0.80
MUSSE-Net	24.54 ± 3.66	132.76 ± 45.63	59.81 ± 20.38	1.31 ± 0.06 %	9.73 ± 1.08

For the public in vivo test set, mean SNR $_e$ values were 0.81 (USENet), 0.96 (ReUSENet), 0.97 (USSE-Net), and 0.99 (MUSSE-Net). On the private BUET dataset and phantom, MUSSE-Net demonstrated consistently sharper lesion boundaries, higher contrast, and suppressed decorrelation noise relative to previous methods. Stability analyses confirm that MUSSE-Net maintains metrics such as SNR $_t$ , SNR $_{bg}$ , and CNR even at high strain (4.5 %), where earlier models degrade.

5. Clinical Significance and Interpretability

The integration of CACFF (for contextual fusion), TCA (for cross-modal attention), and CAF–ConvLSTM (for temporal coherence) enables MUSSE-Net to generate axial strain maps closely reflecting true tissue mechanics. Lesion areas appear with well-defined, reproducible boundaries and contrast, while background noise and artifacts are significantly diminished. The temporal consistency loss ensures that strain estimates remain stable under varying deformations, a requirement for clinical USE applications such as breast lesion assessment and liver fibrosis staging.

6. Limitations and Future Directions

Despite its superior quantitative and qualitative performance, MUSSE-Net exhibits increased training and inference costs due to its multi-stage construction and use of ConvLSTM cells, and currently requires batch size 1 for training. Future research aims include developing lightweight variants, leveraging network pruning, exploring alternative temporal modeling strategies to permit larger batch sizes, and undertaking broader validation across multi-center clinical datasets. These directions are intended to address scalability and generalizability for widespread clinical adoption (Joarder et al., 19 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Multi-Stage Residual-Aware Unsupervised Deep Learning Framework for Consistent Ultrasound Strain Elastography (2025)

MUSSE-Net: Unsupervised Strain Estimation

1. Network Architecture

USSE-Net Backbone

CACFF Multi-Stream Encoder

Tri-Cross Attention (TCA) Bottleneck

CAF Sequential ConvLSTM Decoder

Residual Refinement Stage

2. Unsupervised Loss Functions

3. Strain Calculation

4. Experimental Protocols and Quantitative Evaluation

Datasets

Metrics

Main Results

5. Clinical Significance and Interpretability

6. Limitations and Future Directions

Whiteboard

Follow Topic

Continue Learning

MUSSE-Net: Unsupervised Strain Estimation

1. Network Architecture

USSE-Net Backbone

CACFF Multi-Stream Encoder

Tri-Cross Attention (TCA) Bottleneck

CAF Sequential ConvLSTM Decoder

Residual Refinement Stage

2. Unsupervised Loss Functions

3. Strain Calculation

4. Experimental Protocols and Quantitative Evaluation

Datasets

Metrics

Main Results

5. Clinical Significance and Interpretability

6. Limitations and Future Directions

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics