Papers
Topics
Authors
Recent
2000 character limit reached

USSE-Net: Unsupervised Ultrasound Elastography

Updated 26 November 2025
  • USSE-Net is an unsupervised deep learning architecture that robustly estimates displacement fields and axial strain maps from pre- and post-deformation RF ultrasound sequences.
  • It employs a multi-stream encoder–decoder with CACFF, TCA, and CAF modules, leveraging feature fusion and attention mechanisms for spatial-temporal coherence.
  • Integrated into the MUSSE-Net framework, USSE-Net achieves state-of-the-art improvements in SNR, CNR, and artifact suppression in both simulation and in vivo evaluations.

USSE-Net is an unsupervised end-to-end deep learning architecture for consistent ultrasound strain elastography, designed to robustly estimate displacement fields and axial strain maps from pre- and post-deformation radio-frequency (RF) ultrasound sequences. At its core, USSE-Net features a multi-stream encoder–decoder structure incorporating novel feature fusion and attention mechanisms to overcome limitations such as tissue decorrelation, lack of ground truth, and temporal instability. The network forms the backbone of the multi-stage residual-aware MUSSE-Net framework, which further refines strain outputs and suppresses noise through sequential residual correction (Joarder et al., 19 Nov 2025).

1. Architecture Overview

USSE-Net receives as input a reference (pre-deformation) RF frame Ipre∈R1×H×WI_{\text{pre}}\in\mathbb{R}^{1\times H\times W} and a temporal sequence of TT post-deformation RF frames {Ipostt}t=1T\{I_{\text{post}}^t\}_{t=1}^T. For each pair (Ipre,Ipostt)(I_{\text{pre}}, I_{\text{post}}^t), it predicts a dense 2D displacement field Dt=(Dxt,Dyt)∈R2×H×WD^t=(D_x^t, D_y^t)\in\mathbb{R}^{2\times H\times W}, where DxtD_x^t and DytD_y^t correspond to lateral and axial displacements, respectively. The axial strain map ztz^t is subsequently extracted using a Least-Squares Strain Estimator applied to the predicted axial displacement.

The model consists of three principal modules:

These components are orchestrated to maximize spatial-temporal feature complementarity, global attention, and output consistency.

2. Context-Aware Complementary Feature Fusion (CACFF) Encoder

The CACFF encoder deviates from conventional single- or dual-stream encoders by employing three parallel branches at each downsampling level â„“\ell, with spatial resolution halved and the number of channels doubled per stage (C1=16,C2=32,C3=64,C4=128C_1=16, C_2=32, C_3=64, C_4=128):

  • Pre-frame branch: Processes IpreI_{\text{pre}} to yield residual features fpret,â„“f^{t,\ell}_{\text{pre}}.
  • Post-frame branch: Processes IposttI_{\text{post}}^t with shared weights, producing fpostt,â„“f^{t,\ell}_{\text{post}}.
  • Mid-branch: Uses CACFF blocks to fuse raw concatenated input [Ipre,Ipostt][I_{\text{pre}}, I_{\text{post}}^t] and both feature streams, yielding fmidt,â„“f^{t,\ell}_{\text{mid}}.

Each CACFF block generates shallow convolutional embeddings of the raw input pair and produces fused mid-stream features via a 3×33\times3 convolution over concatenated pre-frame, post-frame, and input embeddings, followed by dimension reduction. Residual connections in all streams facilitate both self-refinement and cross-stream contextual exchange. The output of the fourth block in all branches (fpret,4f_{\text{pre}}^{t,4}, fpostt,4f_{\text{post}}^{t,4}, fmidt,4f_{\text{mid}}^{t,4}) is relayed to the bottleneck for global integration.

3. Tri-Cross Attention (TCA) Bottleneck

Conventional correlation layers in the literature apply local patch-based attention. In contrast, the TCA bottleneck computes three global pairwise attention matrices among the three encoder stream outputs at the final downsampling level (ℓ=4\ell=4), each reshaped to RC4×N\mathbb{R}^{C_4\times N} (with N=(H/16)(W/16)N=(H/16)(W/16)):

  • Pre–Post Attention: Apre→post=softmax(Fpre⊤Fpost)A_{\text{pre→post}} = \mathrm{softmax}(F_{\text{pre}}^\top F_{\text{post}}),
  • Pre–Mid Attention: Apre→mid=softmax(Fpre⊤Fmid)A_{\text{pre→mid}} = \mathrm{softmax}(F_{\text{pre}}^\top F_{\text{mid}}),
  • Post–Mid Attention: Apost→mid=softmax(Fpost⊤Fmid)A_{\text{post→mid}} = \mathrm{softmax}(F_{\text{post}}^\top F_{\text{mid}}).

The corresponding attended features are concatenated and processed through a 1×11\times1 convolution to restore channels then re-weight the mid-stream bottleneck output via pointwise multiplication with a softmax or sigmoid activation. This produces a globally attentive bottleneck feature volume with rich contextualized features for decoding.

4. Cross-Attentive Fusion (CAF) Sequential Decoder with ConvLSTM

The decoder mirrors the four-level encoding, upsampling the bottleneck feature volume to full resolution while leveraging temporal memory via ConvLSTM across the sequence TT. At each upsampling level:

  • The previous hidden state is bilinearly upsampled.
  • The encoder skip connection relevant to that level (from pre, post, or mid branch) is fused via CACFF/TCA.
  • An attention mask is computed over the skip feature, forming a cross-attended skip through elementwise multiplication.
  • The concatenated upsampled hidden state and attended skip are passed via a 3×33\times3 convolution and entered into a ConvLSTM cell, updating the temporal state.

Decoder outputs include level-wise displacement residuals (ΔDlvl,ℓt\Delta D^t_{\mathrm{lvl},\ell}), each upsampled and summed to produce the final displacement field DtD^t. The LSQSE is then used to derive the axial strain map from the axial displacement component.

5. Unsupervised Loss Functions and Consistency Enforcement

Training is fully unsupervised and leverages a composite objective integrating three loss terms:

  • Similarity Loss (LsimL_{\text{sim}}): Encourages the warped post-frame by the predicted displacement to resemble the pre-frame, using the mean normalized local cross-correlation (LNCC).
  • Smoothness Loss (LsmoothL_{\text{smooth}}): Penalizes second-order spatial gradients in the displacement field to enforce biomechanical plausibility.
  • Consistency Loss (LconL_{\text{con}}): Enforces temporal coherence by maximizing cross-correlation of successive strain maps.

The aggregate loss is Ltotal=αLsim+βLcon+γLsmoothL_{\text{total}} = \alpha L_{\text{sim}} + \beta L_{\text{con}} + \gamma L_{\text{smooth}}, with empirically chosen coefficients α=1.0\alpha=1.0, β=0.2\beta=0.2, γ=0.3\gamma=0.3.

6. Multi-Stage Residual Refinement via MUSSE-Net

USSE-Net’s outputs are further refined in the full MUSSE-Net framework, which stacks two USSE-Net stages to iteratively estimate and add residual displacement corrections. Each subsequent stage processes post-frames warped by prior cumulative displacements, predicting residual fields added to previous estimates. Subsequent USSE-Net stages are trained with earlier weights frozen to ensure stable incremental refinement. In practice, two stages suffice for optimal accuracy; further correction yields negligible improvement.

7. Implementation, Training, and Performance Evaluation

USSE-Net and MUSSE-Net are implemented in PyTorch 2.5.1 and trained on NVIDIA V100 GPUs (32 GB) with batch size 1 and ConvLSTM-enabled sequential decoding. Training uses Adam optimization (lr=10−3\mathrm{lr}=10^{-3}) and a plateau scheduler, with 150 epochs for base-stage and 100 epochs for residual-stage.

Empirical evaluation—spanning Field II simulation, public in vivo, and private clinical breast datasets from the Bangladesh University of Engineering and Technology (BUET) Medical Center—demonstrates state-of-the-art unsupervised strain elastography performance:

Metric USSE-Net (Sim.) MUSSE-Net (Sim.)
Target SNR 16.69 ± 3.52 24.54 ± 3.66
Background SNR 102.25 ± 33.36 132.76 ± 45.63
CNR 43.11 ± 14.15 59.81 ± 20.38
Elastographic SNR 9.16 ± 0.80 9.73 ± 1.08
NRMSE (%) 1.12 ± 0.12 1.31 ± 0.06

On public in vivo data, elastographic SNR improved to 0.97 (USSE-Net) and 0.99 (MUSSE-Net), with qualitative gains in lesion detectability and notably suppressed decorrelation artifacts. This suggests the integrated CACFF, TCA, and CAF modules, together with multi-stage residual refinement, address key challenges in ultrasound strain elastography, supporting reliable, high-contrast, and temporally stable strain mapping without supervised targets (Joarder et al., 19 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to USSE-Net.