MoCo-INR: Unsupervised Cardiac MRI Reconstruction
- The paper presents an unsupervised framework that decomposes dynamic cardiac MRI into a static canonical image and continuous deformation fields using implicit neural representations.
- It employs dual neural networks with hash-based coordinate encoding and a coarse-to-fine curriculum to optimize motion compensation from undersampled k-space data.
- Results demonstrate improved PSNR and SSIM at high acceleration factors, yielding fine-grained, artifact-suppressed reconstructions suitable for clinical wall-motion assessment.
MoCo-INR is an unsupervised motion-compensated reconstruction framework that fuses explicit motion modeling with implicit neural representations (INRs) for accelerated dynamic cardiac MRI. This approach leverages the decomposition of a dynamic image sequence into a single time-invariant canonical image and a set of continuous, time-varying deformation fields, with both components parameterized as coordinate-based neural networks. The MoCo-INR paradigm is characterized by its continuous spatial–temporal modeling, unsupervised optimization directly from undersampled k-t space data, and integration of recent neural encoding techniques to enable fine-grained, artifact-suppressed reconstructions even at high acceleration factors, with robust clinical performance for cardiac wall-motion assessment (Tian et al., 14 Nov 2025).
1. Theoretical Motivation and Rationale
MoCo-INR is motivated by the limitations of standalone motion-compensated (MoCo) and pure INR approaches to highly accelerated dynamic MRI reconstruction. Traditional MoCo methods require fully-sampled datasets or discrete image representations and tend to lose high-frequency anatomical detail due to aliasing during nonrigid warping. Pure INR-based dynamic MRI, while imposing a beneficial continuity prior, lacks explicit modeling of time-varying deformations, leading to slow convergence and suboptimal resolution in the presence of severe undersampling.
MoCo-INR addresses these deficiencies by decomposing the reconstruction task into learning (a) a static canonical appearance field and (b) a continuous, temporally-indexed deformation vector field (DVF). Explicit factorization reduces overparameterization and decouples texture from deformation, enabling unsupervised learning even when only highly undersampled, motion-corrupted data is available (Tian et al., 14 Nov 2025).
2. Mathematical Formulation
The MoCo-INR acquisition model for a spatiotemporal dynamic sequence is: where is the image at time frame , are the coil sensitivity maps, the Fourier transform, a binary undersampling mask, and the acquired k-space data for coil . is Gaussian noise.
Motion compensation is achieved by reconstructing each frame as a nonrigid warp of the canonical image via
where is the deformation at spatial position and time .
Both the canonical field and the DVF are realized as coordinate-based networks:
- (canonical image, outputs real and imaginary channels)
- (DVF, outputs displacement for each coordinate and time)
The full prediction for frame :
The data consistency objective (L1-norm) is imposed in k-space for each frame and coil: where
Regularization on the DVFs is critical for stability. The loss includes terms for sparsity (), spatial smoothness (), and curvature ():
Total loss: with in practice.
3. Network Architecture and Encoding Strategy
MoCo-INR employs two parallel neural networks, each leveraging hash-based coordinate encoding ("InstantNGP"-style):
- The canonical image and the deformation field use separate multi-level hash grid encoders.
- Each encoder provides features at multiple spatial resolutions (), concatenated to form the final representation.
The encoded features are fed into compact 3-layer CNN decoders:
- Each decoder employs two 3×3 convolutional layers (64 filters, ReLU) and a final output convolution (2 channels: Re, Im for ; , for ).
A "coarse-to-fine" curriculum is imposed:
- Early optimization only unfreezes low-frequency (coarse) hash levels. High-frequency (fine) levels are progressively enabled, which stabilizes the estimation of large-scale motion before allowing the capture of detail.
4. Unsupervised Optimization and Training Protocol
- Coil sensitivity maps are estimated from a time-averaged reference via ESPIRiT.
- The entire system (appearance and DVF networks) is trained end-to-end on the acquired data, without ground-truth or external supervision.
- Training uses Adam for 1,200 iterations, with standard deep learning schedules:
- Hash-level curricula: initial iterations activate only coarse grid levels; as training progresses, finer levels are incrementally unfrozen.
- Learning rate: .
- Regularization: for ; L1 data-consistency loss.
- Optimization exploits only the data from the subject under examination; no pretraining or transfer learning is required.
5. Performance, Ablation, and Comparative Analysis
Quantitative and qualitative results on the OCMR cine dataset, using retrospective Cartesian VISTA (AF = 12×, 20×) and golden-angle radial (AF ≈ 26×, 69×) undersampling:
| Sampling / AF | MoCo-INR PSNR | MoCo-INR SSIM | Next Best Method | PSNR | SSIM |
|---|---|---|---|---|---|
| VISTA 12× | 42.25 dB | 0.971 | ST-INR + L₀S | 41.35 | 0.972 |
| VISTA 20× | 39.53 dB | 0.957 | SOTA | ~36.6 | ~0.937 |
| Golden-angle 26× | 40.33 dB | 0.960 | — | — | — |
| Golden-angle 69× | 37.75 dB | 0.940 | — | — | — |
Ablation studies demonstrate:
- Replacing the CNN decoder with an MLP decreases PSNR by ≈3 dB and SSIM by ≈0.04; high-frequency ringing is introduced.
- Training all hash levels from the start (removing the curriculum) results in ≈2 dB PSNR loss and spurious DVF estimation in static regions.
- Omitting DVF regularization decreases PSNR by ≈3 dB, increases SSIM loss by ≈0.04, and produces implausible motion fields.
In prospective free-breathing clinical evaluation (VISTA AF = 9×, 65 frames, ∼26 Hz), MoCo-INR achieves diagnostic image quality, accurate wall-motion depiction, and stronger myocardium–blood-pool boundary definition than competing unsupervised approaches. Runtimes for cine reconstruction are ≈1.3 min (VISTA retro) and ≈5.5 min (golden-angle), superior to or comparable with non-INR and hybrid INR baselines (Tian et al., 14 Nov 2025).
6. Motion-Compensated Decomposition and Separation of Appearance/Dynamics
MoCo-INR explicitly enforces a separation between anatomy and motion:
- The canonical appearance field, parameterized by , encodes time-invariant tissue texture and contrast.
- The DVF field, parameterized by , encodes dynamic motion as a continuous spatiotemporal mapping.
This decomposition prevents entanglement of motion and appearance in the network's latent space, which is a documented failure mode in single-network INR or generative approaches under extreme undersampling. The explicit structure also supports interpretable motion tracking and recoverable deformation fields suitable for downstream analysis or reporting.
7. Broader Impact, Limitations, and Prospects
MoCo-INR demonstrates that explicit factorization, hierarchical encoding, and CNN-based decoders are synergistic for dynamic MRI from highly limited measurements. The framework is fully unsupervised and scan-adaptive, supporting a wide range of sampling patterns (Cartesian, radial), acceleration factors (up to 69×), and clinical conditions (including free breathing).
Major limitations are the requirement of preestimated coil sensitivities and the restriction to continuous but deterministic deformation fields—potentially limiting applicability to highly nonrigid, stochastic, or interleaved motion. Extensions to 3D, probabilistic motion models, or alternative encoding strategies are plausible directions for future research. The curriculum adopted for hash-level unfreezing is likely beneficial for other inverse problems involving deformation and appearance disentanglement in medical imaging (Tian et al., 14 Nov 2025).