USSE-Net: Unsupervised Ultrasound Elastography

Updated 26 November 2025

USSE-Net is an unsupervised deep learning architecture that robustly estimates displacement fields and axial strain maps from pre- and post-deformation RF ultrasound sequences.
It employs a multi-stream encoder–decoder with CACFF, TCA, and CAF modules, leveraging feature fusion and attention mechanisms for spatial-temporal coherence.
Integrated into the MUSSE-Net framework, USSE-Net achieves state-of-the-art improvements in SNR, CNR, and artifact suppression in both simulation and in vivo evaluations.

USSE-Net is an unsupervised end-to-end deep learning architecture for consistent ultrasound strain elastography, designed to robustly estimate displacement fields and axial strain maps from pre- and post-deformation radio-frequency (RF) ultrasound sequences. At its core, USSE-Net features a multi-stream encoder–decoder structure incorporating novel feature fusion and attention mechanisms to overcome limitations such as tissue decorrelation, lack of ground truth, and temporal instability. The network forms the backbone of the multi-stage residual-aware MUSSE-Net framework, which further refines strain outputs and suppresses noise through sequential residual correction (Joarder et al., 19 Nov 2025).

1. Architecture Overview

USSE-Net receives as input a reference (pre-deformation) RF frame $I_{\text{pre}}\in\mathbb{R}^{1\times H\times W}$ and a temporal sequence of $T$ post-deformation RF frames $\{I_{\text{post}}^t\}_{t=1}^T$ . For each pair $(I_{\text{pre}}, I_{\text{post}}^t)$ , it predicts a dense 2D displacement field $D^t=(D_x^t, D_y^t)\in\mathbb{R}^{2\times H\times W}$ , where $D_x^t$ and $D_y^t$ correspond to lateral and axial displacements, respectively. The axial strain map $z^t$ is subsequently extracted using a Least-Squares Strain Estimator applied to the predicted axial displacement.

The model consists of three principal modules:

Context-Aware Complementary Feature Fusion (CACFF) encoder (tri-stream),
Tri-Cross Attention (TCA) bottleneck,
Cross-Attentive Fusion (CAF) sequential decoder with ConvLSTM.

These components are orchestrated to maximize spatial-temporal feature complementarity, global attention, and output consistency.

2. Context-Aware Complementary Feature Fusion (CACFF) Encoder

The CACFF encoder deviates from conventional single- or dual-stream encoders by employing three parallel branches at each downsampling level $\ell$ , with spatial resolution halved and the number of channels doubled per stage ( $C_1=16, C_2=32, C_3=64, C_4=128$ ):

Pre-frame branch: Processes $I_{\text{pre}}$ to yield residual features $f^{t,\ell}_{\text{pre}}$ .
Post-frame branch: Processes $I_{\text{post}}^t$ with shared weights, producing $f^{t,\ell}_{\text{post}}$ .
Mid-branch: Uses CACFF blocks to fuse raw concatenated input $[I_{\text{pre}}, I_{\text{post}}^t]$ and both feature streams, yielding $f^{t,\ell}_{\text{mid}}$ .

Each CACFF block generates shallow convolutional embeddings of the raw input pair and produces fused mid-stream features via a $3\times3$ convolution over concatenated pre-frame, post-frame, and input embeddings, followed by dimension reduction. Residual connections in all streams facilitate both self-refinement and cross-stream contextual exchange. The output of the fourth block in all branches ( $f_{\text{pre}}^{t,4}$ , $f_{\text{post}}^{t,4}$ , $f_{\text{mid}}^{t,4}$ ) is relayed to the bottleneck for global integration.

3. Tri-Cross Attention (TCA) Bottleneck

Conventional correlation layers in the literature apply local patch-based attention. In contrast, the TCA bottleneck computes three global pairwise attention matrices among the three encoder stream outputs at the final downsampling level ( $\ell=4$ ), each reshaped to $\mathbb{R}^{C_4\times N}$ (with $N=(H/16)(W/16)$ ):

Pre–Post Attention: $A_{\text{pre→post}} = \mathrm{softmax}(F_{\text{pre}}^\top F_{\text{post}})$ ,
Pre–Mid Attention: $A_{\text{pre→mid}} = \mathrm{softmax}(F_{\text{pre}}^\top F_{\text{mid}})$ ,
Post–Mid Attention: $A_{\text{post→mid}} = \mathrm{softmax}(F_{\text{post}}^\top F_{\text{mid}})$ .

The corresponding attended features are concatenated and processed through a $1\times1$ convolution to restore channels then re-weight the mid-stream bottleneck output via pointwise multiplication with a softmax or sigmoid activation. This produces a globally attentive bottleneck feature volume with rich contextualized features for decoding.

4. Cross-Attentive Fusion (CAF) Sequential Decoder with ConvLSTM

The decoder mirrors the four-level encoding, upsampling the bottleneck feature volume to full resolution while leveraging temporal memory via ConvLSTM across the sequence $T$ . At each upsampling level:

The previous hidden state is bilinearly upsampled.
The encoder skip connection relevant to that level (from pre, post, or mid branch) is fused via CACFF/TCA.
An attention mask is computed over the skip feature, forming a cross-attended skip through elementwise multiplication.
The concatenated upsampled hidden state and attended skip are passed via a $3\times3$ convolution and entered into a ConvLSTM cell, updating the temporal state.

Decoder outputs include level-wise displacement residuals ( $\Delta D^t_{\mathrm{lvl},\ell}$ ), each upsampled and summed to produce the final displacement field $D^t$ . The LSQSE is then used to derive the axial strain map from the axial displacement component.

5. Unsupervised Loss Functions and Consistency Enforcement

Training is fully unsupervised and leverages a composite objective integrating three loss terms:

Similarity Loss ( $L_{\text{sim}}$ ): Encourages the warped post-frame by the predicted displacement to resemble the pre-frame, using the mean normalized local cross-correlation (LNCC).
Smoothness Loss ( $L_{\text{smooth}}$ ): Penalizes second-order spatial gradients in the displacement field to enforce biomechanical plausibility.
Consistency Loss ( $L_{\text{con}}$ ): Enforces temporal coherence by maximizing cross-correlation of successive strain maps.

The aggregate loss is $L_{\text{total}} = \alpha L_{\text{sim}} + \beta L_{\text{con}} + \gamma L_{\text{smooth}}$ , with empirically chosen coefficients $\alpha=1.0$ , $\beta=0.2$ , $\gamma=0.3$ .

USSE-Net’s outputs are further refined in the full MUSSE-Net framework, which stacks two USSE-Net stages to iteratively estimate and add residual displacement corrections. Each subsequent stage processes post-frames warped by prior cumulative displacements, predicting residual fields added to previous estimates. Subsequent USSE-Net stages are trained with earlier weights frozen to ensure stable incremental refinement. In practice, two stages suffice for optimal accuracy; further correction yields negligible improvement.

7. Implementation, Training, and Performance Evaluation

USSE-Net and MUSSE-Net are implemented in PyTorch 2.5.1 and trained on NVIDIA V100 GPUs (32 GB) with batch size 1 and ConvLSTM-enabled sequential decoding. Training uses Adam optimization ( $\mathrm{lr}=10^{-3}$ ) and a plateau scheduler, with 150 epochs for base-stage and 100 epochs for residual-stage.

Empirical evaluation—spanning Field II simulation, public in vivo, and private clinical breast datasets from the Bangladesh University of Engineering and Technology (BUET) Medical Center—demonstrates state-of-the-art unsupervised strain elastography performance:

Metric	USSE-Net (Sim.)	MUSSE-Net (Sim.)
Target SNR	16.69 ± 3.52	24.54 ± 3.66
Background SNR	102.25 ± 33.36	132.76 ± 45.63
CNR	43.11 ± 14.15	59.81 ± 20.38
Elastographic SNR	9.16 ± 0.80	9.73 ± 1.08
NRMSE (%)	1.12 ± 0.12	1.31 ± 0.06

On public in vivo data, elastographic SNR improved to 0.97 (USSE-Net) and 0.99 (MUSSE-Net), with qualitative gains in lesion detectability and notably suppressed decorrelation artifacts. This suggests the integrated CACFF, TCA, and CAF modules, together with multi-stage residual refinement, address key challenges in ultrasound strain elastography, supporting reliable, high-contrast, and temporally stable strain mapping without supervised targets (Joarder et al., 19 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Multi-Stage Residual-Aware Unsupervised Deep Learning Framework for Consistent Ultrasound Strain Elastography (2025)

USSE-Net: Unsupervised Ultrasound Elastography

1. Architecture Overview

2. Context-Aware Complementary Feature Fusion (CACFF) Encoder

3. Tri-Cross Attention (TCA) Bottleneck

4. Cross-Attentive Fusion (CAF) Sequential Decoder with ConvLSTM

5. Unsupervised Loss Functions and Consistency Enforcement

6. Multi-Stage Residual Refinement via MUSSE-Net

7. Implementation, Training, and Performance Evaluation

Whiteboard

Follow Topic

Continue Learning

USSE-Net: Unsupervised Ultrasound Elastography

1. Architecture Overview

2. Context-Aware Complementary Feature Fusion (CACFF) Encoder

3. Tri-Cross Attention (TCA) Bottleneck

4. Cross-Attentive Fusion (CAF) Sequential Decoder with ConvLSTM

5. Unsupervised Loss Functions and Consistency Enforcement

6. Multi-Stage Residual Refinement via MUSSE-Net

7. Implementation, Training, and Performance Evaluation

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics