RegDeepLab: Dual-Task IVF Embryo Grading
- The paper presents a dual-branch framework that integrates semantic segmentation (DeepLabV3+) and multi-scale regression with a modified ResNet-50, achieving state-of-the-art performance (Dice=0.729, MAE=0.046).
- It employs a novel two-stage decoupled training paradigm that mitigates gradient conflict and negative transfer, ensuring both detailed pixel-level segmentation and robust grading.
- Feature Injection bridges the segmentation and regression branches by transferring latent vectors, enhancing clinical explainability and quantitative grading precision.
RegDeepLab is a dual-branch multi-task learning framework for interpretable embryo fragmentation grading in in vitro fertilization (IVF) decision support. It addresses limitations of prior fully automated regression and segmentation methods by integrating state-of-the-art semantic segmentation (DeepLabV3+) and a multi-scale regression head into a unified architecture that preserves both clinical explainability and quantitative grading precision. Its novel two-stage decoupled training regimen resolves the gradient conflict and negative transfer frequently encountered in multi-task settings, achieving state-of-the-art segmentation accuracy (Dice=0.729) while providing low mean absolute error (MAE=0.046) in grading (Lee, 23 Nov 2025).
1. Network Architecture
RegDeepLab is implemented atop a modified ResNet-50 backbone utilizing dilated convolutions to yield an output stride of 16. The shared feature extractor outputs are processed by two task-specific heads:
- Segmentation Branch: Leverages DeepLabV3+ with attention gating mechanisms. Distinctive feature maps are utilized:
- represents the deepest features, processed by Atrous Spatial Pyramid Pooling (ASPP) for multi-scale context.
- supplies low-level boundaries after an attention gate suppresses cytoplasmic noise.
- The decoded feature map is upsampled and concatenated with a global regression vector () broadcast spatially (a process termed “Feature Injection”). This produces a fused feature of 2304 channels, which is passed through a final convolution and a sigmoid to yield the mask .
- Regression Branch: Concatenates intermediate features ( and ), followed by global average pooling and a multi-layer perceptron (MLP) to predict the continuous fragmentation ratio . The extracted latent vector prior to the output MLP () is used for Feature Injection into the segmentation decoder.
The overall feature flow is defined as:
2. Two-Stage Decoupled Training Paradigm
Standard multi-task learning in this domain is impaired by "Gradient Conflict" (where segmentation seeks high-frequency boundary detail and regression optimizes toward low-frequency holistic abstraction) and "Negative Transfer" in the backbone. RegDeepLab mitigates these issues by a temporally separated two-stage procedure:
- Stage 1: Visual Expert Pre-training
- The backbone and segmentation head are optimized jointly using segmentation and area consistency loss (), with the regression head disabled ().
- This produces segmentation at the state-of-the-art level (Dice=0.729), ensuring precise pixel-level mask quality.
- Stage 2: Regression-Guided Finetuning
- The trained backbone is frozen; only the regression head and the last segmentation convolutional layer (after Feature Injection) are updated.
- The objective combines precise regression loss with range constraints and (optionally) consistency (, ). Disabling regression-backbone gradients maintains segmentation integrity.
- This results in robust grading (MAE=0.049) while not sacrificing segmentation quality (Dice=0.729) (Lee, 23 Nov 2025).
3. Loss Functions
The composite multi-task loss is formalized:
- Segmentation Loss combines pixel-wise binary cross-entropy, Dice coefficient, and Focal loss, weighted for class imbalance:
Where, for pixels, (ground truth), (prediction): - - -
- Regression Loss is the sum of a precise regression loss and a range-based loss for weakly labeled grade-only data:
-
- Consistency Loss enforces agreement between the predicted mask area and the target ratio:
4. Empirical Performance and Ablation
The dataset consists of 318 fully-annotated (pixel+grade) and 1549 grading-only images. Performance is measured by Dice coefficient for segmentation and MAE for grading. The results, as presented in (Lee, 23 Nov 2025), are:
| Experiment | Dice | MAE |
|---|---|---|
| Stage 1 only () | 0.729 | — |
| Pure regression single-task | — | 0.051 |
| End-to-end MTL w/ Feature Injection | 0.716 | 0.046 |
| End-to-end MTL w/o Injection | 0.678 | 0.053 |
| Two-Stage Decoupled (frozen backbone) | 0.729 | 0.049 |
Ablation studies indicate that "Feature Injection" substantially improves joint performance by enabling mutual information transfer. Removal degrades Dice from 0.716 to 0.678 and increases MAE from 0.046 to 0.053, demonstrating that naive loss summation is insufficient to resolve gradient conflict. End-to-end MTL provides minimal grading error (MAE=0.046) but at the cost of segmentation boundary fidelity, while the decoupled strategy retains segmentation peak (Dice=0.729) at a minor cost in MAE (0.049). The adoption of Range Loss for semi-supervised inclusion of weakly labeled data further improves MAE by approximately 0.002–0.003.
5. Clinical Interpretability and Deployment
RegDeepLab yields dual outputs for each embryo image: a pixel-level fragmentation mask and a continuous fragmentation ratio. The segmentation mask supports visual verification by embryologists, confirming that the model focuses on cytoplasmic fragments rather than artifactual noise. This offers an interpretable bridge between automated grading and clinical practice.
Deployment leverages a dual-module system:
- Module A (Visual Assistant): Uses the Stage 1 model to maximize mask fidelity for expert review (Dice=0.729).
- Module B (Automated Quantification): Employs the full MTL model for lowest error in fragmentation grading (MAE=0.046).
This configuration reduces inter-observer variability, expedites embryo selection, and retains clinical explainability by providing both visual and quantitative outputs.
6. Limitations and Prospective Directions
The RegDeepLab framework currently operates on static single time-point cleavage-stage embryo images. Future extensions to time-lapse or multi-modal data may capture dynamic fragmentation phenomena and further augment predictive reliability. This suggests the approach may generalize to broader imaging-based clinical grading tasks requiring interpretable multi-task learning solutions (Lee, 23 Nov 2025).