Fetal Heart View Classification in First Trimester

Updated 6 January 2026

First-trimester fetal heart view classification is the automated detection of standard cardiac planes (aorta, four-chamber, V sign, X sign, and others) in 11–14 week ultrasounds.
The USF-MAE transformer model and CNN-based approaches improve diagnostic accuracy by leveraging full-frame imaging and self-supervised pretraining without aggressive ROI cropping.
Challenges such as low-quality frames, temporal dynamics, inter-operator variability, and limited annotated data require innovative preprocessing and augmentation strategies for robust CHD screening.

First-trimester fetal heart view classification refers to the automated identification of standard diagnostic cardiac planes in ultrasound imaging of fetuses during the first trimester (gestational weeks 11–14). These views—such as the aorta, atrioventricular flows (four-chamber), V sign (three-vessel), X sign (right ventricular outflow tract), and non-diagnostic ("Other") frames—are critical for early detection of congenital heart disease (CHD), a leading cause of neonatal morbidity and mortality. Automated techniques must contend with small fetal cardiac structures, low signal-to-noise ratio, and considerable inter-operator variability, making this a technically challenging task.

1. Anatomical and Clinical Context

During the first trimester, the fetal heart is structurally complex and measures only a few millimeters across, resulting in uniquely demanding imaging conditions. Standard diagnostic planes include:

Aorta: identification of the aortic outflow.
Atrioventricular flows (Four-chamber view): displays both atria and ventricles.
V sign (Three-vessel view): visualizes the pulmonary artery, aorta, and superior vena cava.
X sign (Right ventricular outflow tract): detects the crossing pattern formed by great vessels.
Other (Non-diagnostic): frames lacking a clear diagnostic view.

Precise recognition of these planes enables early, non-invasive screening for major cardiac anomalies. Automated classification aims to reduce the operator dependence and throughput limitations inherent to manual scanning and interpretation (Megahed et al., 30 Dec 2025).

2. Datasets and Reference Benchmarks

Public datasets have been constructed specifically for first-trimester fetal heart view classification. The most notable comprises 6,720 still frames extracted from routine 12–14-week cardiac sweeps, stratified to avoid patient-level leakage: 50% train, 25% validation, 25% test, with balance across all five view classes. Frames are preprocessed via fixed-border cropping to remove metadata overlays and normalized using ImageNet statistics to ensure compatibility with established network architectures. Unlike earlier pipelines, aggressive region-of-interest (ROI) cropping or convex-hull isolation is not applied—full frames (224×224 px) are retained to preserve essential global context, especially for “Other” frames (Megahed et al., 30 Dec 2025).

Evaluation protocols stress the need for both per-frame and per-patient analysis. Confusion matrices, precision, recall, F1 scores, and ROC/PR curves are routinely reported for multiclass view classification. Patient-level CHD detection is typically evaluated via ROC-AUC, with statistical comparison methods such as DeLong’s test and McNemar’s test (Tan et al., 2020).

3. Methodologies for Classification

Algorithms can be grouped by foundational approach:

A. Ultrasound-Specific Self-Supervised Transformers

The USF-MAE (Ultrasound-Specific Foundation Model—Masked Autoencoder) represents the state-of-the-art. It leverages masked autoencoding pretraining on over 370,000 unlabelled frames from >40 anatomical regions, learning domain-optimized representations. Pretraining involves masking 25% of 16×16 pixel patches, reconstructing them from the 75% visible subset using a Vision Transformer (ViT) encoder and transformer decoder, with a mean-squared error loss:

$L_{\text{MAE}} = \mathbb{E}\| X_{\text{masked}} - \text{Decoder}(\text{Encoder}(X_{\text{visible}})) \|^2_2$

Only the pretrained ViT encoder is retained and subsequently fine-tuned for the fetal heart view classification task (5-way softmax). This model eliminates the need for handcrafted ROI segmentation and demonstrates robust handling of full-frame sonography (Megahed et al., 30 Dec 2025).

B. Convolutional Neural Networks with Auxiliary View Tasks

Earlier methods, such as those based on VGG-style backbones, split feature extraction and classification into shared and task-specific branches. A multitask loss,

$L_\text{total} = L_{\text{CHD}} + \lambda \cdot L_\text{view}$

jointly optimizes both binary CHD detection and three-way view classification. The weighting $\lambda$ is dynamically scaled so that the primary (CHD) task is prioritized during difficult examples (Tan et al., 2020). Robustness is enhanced by adversarial-style perturbations along view-loss gradients, with frame-level confidence filtering.

C. Traditional Shape-Based Approaches

Zernike moment-based segmentation exploits rotation-invariant orthogonal basis functions to extract geometric descriptors from color Doppler ultrasound. Each frame’s region of interest (ROI) is described by a vector of geometric features and the magnitudes of the first 25 Zernike moments, enabling nearest-neighbor classification into V sign, X sign, or "parallel" red flow views (Stoean et al., 2019). This approach is limited by small hand-picked training sets and lacks temporal context, but remains interpretable and computationally lightweight.

D. Spatiotemporal Deep Networks

Temporal HeartNet combines VGG-16 backbone feature extraction with per-patch sliding-window heads for classification, localization, and orientation. Bi-directional LSTMs model temporal continuity, conditioning segment-level predictions on video context. Localization is optimized using either circular anchors or a direct IoU loss:

$L_\text{loc}^{(i,j)} = -\log \frac{|\hat{A} \cap A|}{|\hat{A} \cup A|}$

The multi-task loss encompasses classification, localization, and orientation:

$L = \sum_{i,j,k}\left[L_\text{cls}^{(i,j,k)} + \lambda_1 L_\text{loc}^{(i,j,k)} + \lambda_2 L_\text{ori}^{(i,j,k)}\right]$

Temporal modeling reduces error rates by leveraging persistence of anatomical structures across frames (Huang et al., 2017).

4. Quantitative Performance and Comparative Results

Recent transformer-based methods, particularly the USF-MAE, establish new performance benchmarks on the first-trimester fetal heart view classification task. On a stratified independent test set, USF-MAE achieved:

Model	Accuracy	Precision	Recall	F1-Score
USF-MAE	90.57%	91.15%	90.57%	90.71%
ResNet-18	88.54%	89.41%	88.54%	88.73%
ResNet-50	86.95%	—	—	87.26%
ViT-B/16	87.16%	87.16%	87.16%	87.16%

USF-MAE improved accuracy by +2.03% and F1-score by +1.98% relative to ResNet-18, the strongest supervised baseline. One-vs-rest AUCs range 0.954–0.999, and average precision for the smallest class (X sign) remains strong at 0.779, indicating resilience to class imbalance. The confusion matrix patterns show most residual errors occur between diagnostically adjacent or anatomically ambiguous views (Megahed et al., 30 Dec 2025).

Traditional Zernike moment systems, though evaluated qualitatively, show ≈90% recall for the “V” sign and ≈85% recall for the “X” sign, with specificity for “other” ≈80–85% (Stoean et al., 2019). Temporal HeartNet’s region-level temporal modeling reduces overall error in view classification to 16.1%, corresponding to ≈83.9% accuracy and equaling or surpassing human inter-rater error (Huang et al., 2017).

5. Data Preprocessing and Augmentation Strategies

Data curation and preprocessing are adapted for the unique characteristics of first-trimester heart imaging. Notably:

Preprocessing: Fixed-border cropping to eliminate metadata overlays; intensity normalization to enable compatibility with both CNNs and transformers; no ROI-based cropping to preserve global anatomical context.
Augmentation: In USF-MAE, spatial (rotation 0–90°, flips, random resized cropping) but not intensity-based augmentations are applied, preventing artificial distortion of ultrasound textures (Megahed et al., 30 Dec 2025). For CNN multitask approaches, smaller rotations (±15°), horizontal flips, minor cropping, and mild Gaussian intensity jitter are selected to maintain geometry of tiny structures (Tan et al., 2020). In traditional approaches, Doppler channel separation, Otsu thresholding, and geometric shape normalization are employed (Stoean et al., 2019).

6. Current Limitations and Prospective Improvements

Contemporary classification approaches still face several limitations:

Temporal Dynamics: Frame-based models (including USF-MAE as implemented) do not exploit the temporal coherence present in cine sweeps, in contrast to LSTM-augmented architectures (Megahed et al., 30 Dec 2025, Huang et al., 2017). Incorporating temporal information may address inter-frame ambiguity and further boost accuracy.
Domain Robustness: While datasets are drawn from varied institutions and equipment, cross-vendor generalizability has not been explicitly quantified.
Low-Quality Frame Handling: Errors preferentially occur in noisy, non-diagnostic, or borderline frames, especially under class imbalance. However, full-frame models demonstrate improved discrimination of such cases compared to methods relying on aggressive ROI cropping.
Dataset Scale and Annotation: Label scarcity remains a limitation, especially for supervised approaches and rare anatomical variants. Self-supervised pretraining and multitask learning alleviate this but do not obviate the need for representative labelled data.

Enhancements under investigation include adaptive cropping, uncertainty-aware model selection for real-time deployment, domain adaptation for unseen hardware, and the integration of spatiotemporal dynamics to move beyond per-frame inference (Megahed et al., 30 Dec 2025, Huang et al., 2017).

7. Connections to Broader Research and Clinical Impact

Ultrasound-specific self-supervised learning, as operationalized in USF-MAE, demonstrates substantial transfer gains not just for heart view classification but also for related congenital anomaly screening (e.g., cystic hygroma, renal/brain anomaly) (Megahed et al., 30 Dec 2025). Multitask learning and robust inference filtering strategies have also been leveraged to stabilize learning and decision-making in low-signal settings (Tan et al., 2020). Shape-moment methods offer interpretable feature spaces and computational thrift but are generally superseded in accuracy by deep learning architectures as annotated corpora grow.

The precise, reliable classification of first-trimester heart views at scale is anticipated to augment sonographer workflow and expand early CHD screening access by reducing reliance on high-level sonographic expertise (Megahed et al., 30 Dec 2025). A plausible implication is that these advances will facilitate real-time, full-frame fetal cardiac assessment and enable more equitable population-level early congenital heart disease detection.