MoCo-INR: Unsupervised Cardiac MRI Reconstruction

Updated 21 November 2025

The paper presents an unsupervised framework that decomposes dynamic cardiac MRI into a static canonical image and continuous deformation fields using implicit neural representations.
It employs dual neural networks with hash-based coordinate encoding and a coarse-to-fine curriculum to optimize motion compensation from undersampled k-space data.
Results demonstrate improved PSNR and SSIM at high acceleration factors, yielding fine-grained, artifact-suppressed reconstructions suitable for clinical wall-motion assessment.

MoCo-INR is an unsupervised motion-compensated reconstruction framework that fuses explicit motion modeling with implicit neural representations (INRs) for accelerated dynamic cardiac MRI. This approach leverages the decomposition of a dynamic image sequence into a single time-invariant canonical image and a set of continuous, time-varying deformation fields, with both components parameterized as coordinate-based neural networks. The MoCo-INR paradigm is characterized by its continuous spatial–temporal modeling, unsupervised optimization directly from undersampled k-t space data, and integration of recent neural encoding techniques to enable fine-grained, artifact-suppressed reconstructions even at high acceleration factors, with robust clinical performance for cardiac wall-motion assessment (Tian et al., 14 Nov 2025).

1. Theoretical Motivation and Rationale

MoCo-INR is motivated by the limitations of standalone motion-compensated (MoCo) and pure INR approaches to highly accelerated dynamic MRI reconstruction. Traditional MoCo methods require fully-sampled datasets or discrete image representations and tend to lose high-frequency anatomical detail due to aliasing during nonrigid warping. Pure INR-based dynamic MRI, while imposing a beneficial continuity prior, lacks explicit modeling of time-varying deformations, leading to slow convergence and suboptimal resolution in the presence of severe undersampling.

MoCo-INR addresses these deficiencies by decomposing the reconstruction task into learning (a) a static canonical appearance field and (b) a continuous, temporally-indexed deformation vector field (DVF). Explicit factorization reduces overparameterization and decouples texture from deformation, enabling unsupervised learning even when only highly undersampled, motion-corrupted data is available (Tian et al., 14 Nov 2025).

2. Mathematical Formulation

The MoCo-INR acquisition model for a spatiotemporal dynamic sequence is: $y_{t,c} = M_t F S_c x_t + n_{t,c}$ where $x_t$ is the image at time frame $t$ , $S_c$ are the coil sensitivity maps, $F$ the Fourier transform, $M_t$ a binary undersampling mask, and $y_{t,c}$ the acquired k-space data for coil $c$ . $n_{t,c}$ is Gaussian noise.

Motion compensation is achieved by reconstructing each frame as a nonrigid warp of the canonical image $x_{\text{cano}}$ via

$x_t(p) = x_{\text{cano}}(p + u_t(p)),$

where $u_t(p)$ is the deformation at spatial position $p$ and time $t$ .

Both the canonical field and the DVF are realized as coordinate-based networks:

$g_\psi: \mathbb{R}^2 \rightarrow \mathbb{C}$ (canonical image, outputs real and imaginary channels)
$f_\phi: \mathbb{R}^2 \times [1,T] \rightarrow \mathbb{R}^2$ (DVF, outputs displacement for each coordinate and time)

The full prediction for frame $t$ : $x_t(p) = g_\psi(p + f_\phi(p, t))$

The data consistency objective (L1-norm) is imposed in k-space for each frame and coil: $L_{\text{DC}} = \sum_{t, c} \left\| \hat{y}_{t,c} - y_{t,c} \right\|_{1}$ where

$\hat{y}_{t,c} = M_t F S_c \Big[g_\psi ( p + f_\phi (p, t) ) \Big]$

Regularization on the DVFs is critical for stability. The loss includes terms for sparsity ( $\|u_t\|_1$ ), spatial smoothness ( $\|\nabla u_t\|_1$ ), and curvature ( $\|\nabla^2 u_t\|_1$ ): $L_{\text{DVF}} = \sum_t \left( \|u_t\|_{1} + \|\nabla u_t\|_{1} + \|\nabla^{2} u_t\|_{1} \right)$

Total loss: $L = L_{\text{DC}} + \lambda\,L_{\text{DVF}}$ with $\lambda=1$ in practice.

3. Network Architecture and Encoding Strategy

MoCo-INR employs two parallel neural networks, each leveraging hash-based coordinate encoding ("InstantNGP"-style):

The canonical image $g_\psi$ and the deformation field $f_\phi$ use separate multi-level hash grid encoders.
Each encoder provides features at multiple spatial resolutions ( $\ell=1\ldots L$ ), concatenated to form the final representation.

The encoded features are fed into compact 3-layer CNN decoders:

Each decoder employs two 3×3 convolutional layers (64 filters, ReLU) and a final output convolution (2 channels: Re, Im for $g_\psi$ ; $\Delta x$ , $\Delta y$ for $f_\phi$ ).

A "coarse-to-fine" curriculum is imposed:

Early optimization only unfreezes low-frequency (coarse) hash levels. High-frequency (fine) levels are progressively enabled, which stabilizes the estimation of large-scale motion before allowing the capture of detail.

4. Unsupervised Optimization and Training Protocol

Coil sensitivity maps are estimated from a time-averaged reference via ESPIRiT.
The entire system (appearance and DVF networks) is trained end-to-end on the acquired data, without ground-truth or external supervision.
Training uses Adam for 1,200 iterations, with standard deep learning schedules:
- Hash-level curricula: initial iterations activate only coarse grid levels; as training progresses, finer levels are incrementally unfrozen.
- Learning rate: $10^{-3}$ .
- Regularization: $\lambda=1$ for $L_{\text{DVF}}$ ; L1 data-consistency loss.
Optimization exploits only the data from the subject under examination; no pretraining or transfer learning is required.

5. Performance, Ablation, and Comparative Analysis

Quantitative and qualitative results on the OCMR cine dataset, using retrospective Cartesian VISTA (AF = 12×, 20×) and golden-angle radial (AF ≈ 26×, 69×) undersampling:

Sampling / AF	MoCo-INR PSNR	MoCo-INR SSIM	Next Best Method	PSNR	SSIM
VISTA 12×	42.25 dB	0.971	ST-INR + L₀S	41.35	0.972
VISTA 20×	39.53 dB	0.957	SOTA	~36.6	~0.937
Golden-angle 26×	40.33 dB	0.960	—	—	—
Golden-angle 69×	37.75 dB	0.940	—	—	—

Ablation studies demonstrate:

Replacing the CNN decoder with an MLP decreases PSNR by ≈3 dB and SSIM by ≈0.04; high-frequency ringing is introduced.
Training all hash levels from the start (removing the curriculum) results in ≈2 dB PSNR loss and spurious DVF estimation in static regions.
Omitting DVF regularization decreases PSNR by ≈3 dB, increases SSIM loss by ≈0.04, and produces implausible motion fields.

In prospective free-breathing clinical evaluation (VISTA AF = 9×, 65 frames, ∼26 Hz), MoCo-INR achieves diagnostic image quality, accurate wall-motion depiction, and stronger myocardium–blood-pool boundary definition than competing unsupervised approaches. Runtimes for cine reconstruction are ≈1.3 min (VISTA retro) and ≈5.5 min (golden-angle), superior to or comparable with non-INR and hybrid INR baselines (Tian et al., 14 Nov 2025).

6. Motion-Compensated Decomposition and Separation of Appearance/Dynamics

MoCo-INR explicitly enforces a separation between anatomy and motion:

The canonical appearance field, parameterized by $g_\psi$ , encodes time-invariant tissue texture and contrast.
The DVF field, parameterized by $f_\phi$ , encodes dynamic motion as a continuous spatiotemporal mapping.

This decomposition prevents entanglement of motion and appearance in the network's latent space, which is a documented failure mode in single-network INR or generative approaches under extreme undersampling. The explicit structure also supports interpretable motion tracking and recoverable deformation fields suitable for downstream analysis or reporting.

7. Broader Impact, Limitations, and Prospects

MoCo-INR demonstrates that explicit factorization, hierarchical encoding, and CNN-based decoders are synergistic for dynamic MRI from highly limited measurements. The framework is fully unsupervised and scan-adaptive, supporting a wide range of sampling patterns (Cartesian, radial), acceleration factors (up to 69×), and clinical conditions (including free breathing).

Major limitations are the requirement of preestimated coil sensitivities and the restriction to continuous but deterministic deformation fields—potentially limiting applicability to highly nonrigid, stochastic, or interleaved motion. Extensions to 3D, probabilistic motion models, or alternative encoding strategies are plausible directions for future research. The curriculum adopted for hash-level unfreezing is likely beneficial for other inverse problems involving deformation and appearance disentanglement in medical imaging (Tian et al., 14 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Unsupervised Motion-Compensated Decomposition for Cardiac MRI Reconstruction via Neural Representation (2025)

MoCo-INR: Unsupervised Cardiac MRI Reconstruction

1. Theoretical Motivation and Rationale

2. Mathematical Formulation

3. Network Architecture and Encoding Strategy

4. Unsupervised Optimization and Training Protocol

5. Performance, Ablation, and Comparative Analysis

6. Motion-Compensated Decomposition and Separation of Appearance/Dynamics

7. Broader Impact, Limitations, and Prospects

Whiteboard

Follow Topic

Continue Learning

MoCo-INR: Unsupervised Cardiac MRI Reconstruction

1. Theoretical Motivation and Rationale

2. Mathematical Formulation

3. Network Architecture and Encoding Strategy

4. Unsupervised Optimization and Training Protocol

5. Performance, Ablation, and Comparative Analysis

6. Motion-Compensated Decomposition and Separation of Appearance/Dynamics

7. Broader Impact, Limitations, and Prospects

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics