Papers
Topics
Authors
Recent
2000 character limit reached

CAR: Conditional Joint Annotation Regularization

Updated 26 December 2025
  • The paper introduces CAR, a method that leverages incomplete annotations to achieve robust segmentation in chest X-rays with competitive DSC scores.
  • CAR employs latent-space regularization via an autoencoder, enforcing anatomical consistency by integrating segmentation, distribution, and reconstruction losses.
  • The framework integrates with a 2D U-Net backbone and shows improved scalability and accuracy on synthetic and real medical imaging datasets.

Conditional Joint Annotation Regularization (CAR) is a learning strategy designed to enable robust multi-organ segmentation in medical imaging scenarios where ground-truth annotations are sparse, partially available, or noisy. Originally introduced within the AnyCXR framework for chest X-ray (CXR) analysis, CAR addresses the challenge of leveraging large-scale synthetic datasets exhibiting incomplete or imperfect labels, enforcing anatomical consistency through latent space regularization. The approach enables end-to-end training without discarding partially labeled samples, maintaining accuracy and anatomical integrity even under diverse image acquisition conditions (Zifei et al., 19 Dec 2025).

1. Problem Formulation and Motivation

CAR addresses the segmentation of CC anatomical classes in chest X-rays XRH×WX \in \mathbb{R}^{H \times W}, where model predictions Y^=fϕ(X)\hat{Y} = f_\phi(X) are CC-channel probability maps. The key challenge is that for each image, only a subset of classes is annotated. This partial annotation is tracked by a binary availability mask M{0,1}B×CM \in \{0,1\}^{B \times C} (for batch size BB), such that M(b,c)=1M^{(b,c)} = 1 if class cc is labeled in sample bb. Labels YgtY_{gt} may arise from automated CT-based algorithms and are often incomplete or imprecise. CAR is designed to exploit all available labels (even if partial) by operating on tuples {X,Y^,M,Ygt}\{X, \hat{Y}, M, Y_{gt}\} without assuming label completeness.

This formulation is instrumental for large-scale synthetic pipelines (e.g., generation of 126,000 DRRs from 2,982 CT volumes), in which the cost or feasibility of exhaustive manual annotation is prohibitive. CAR thus enables scalable anatomy-aware learning from imperfectly labeled synthetic data (Zifei et al., 19 Dec 2025).

2. CAR Objective: Latent Consistency and Supervision with Incomplete Labels

The CAR objective function integrates several loss terms, each serving a particular facet of the learning problem:

  • Segmentation Loss (LSegL_{Seg}): Computes the mean Dice similarity coefficient (DSC) over only the available labels, as indicated by MM:

LSeg=1b,cM(b,c)b=1Bc=1CM(b,c)Dice(Y^(b,c),Ygt(b,c))L_{Seg} = \frac{1}{\sum_{b,c} M^{(b,c)}} \sum_{b=1}^B \sum_{c=1}^C M^{(b,c)} \cdot \text{Dice}\bigl(\hat{Y}^{(b,c)}, Y_{gt}^{(b,c)}\bigr)

  • Reliable Target Construction: For each class/sample pair, CAR constructs a target Y(b,c)Y^{(b,c)} defined as:

Y(b,c)={Ygt(b,c)if M(b,c)=1 Y^(b,c)   ⁣ ⁣no-gradotherwiseY^{(b,c)} = \begin{cases} Y_{gt}^{(b,c)} & \text{if } M^{(b,c)} = 1 \ \hat{Y}^{(b,c)} \;|\!\!\text{no-grad} & \text{otherwise} \end{cases}

This formulation ensures that gradients are not propagated through unobserved labels, avoiding spurious learning signals.

  • Latent-space Distribution Loss (LDistL_{Dist}): An encoder Gθ1G_{\theta_1} maps [X;P][X;P] (where PP is either a probability map or label mask) to a latent embedding zRdz \in \mathbb{R}^d. The loss enforces that the predicted segmentation and reliable target produce similar latent codes via cosine distance:

LDist=1Bb=1B[1zY^(b)zY(b)zY^(b)zY(b)]L_{Dist} = \frac{1}{B} \sum_{b=1}^B \left[ 1 - \frac{z_{\hat{Y}}^{(b)} \cdot z_{Y}^{(b)}}{\|z_{\hat{Y}}^{(b)}\| \| z_{Y}^{(b)} \|} \right]

  • Reconstruction Loss (LReconL_{Recon}): A decoder Gθ2G_{\theta_2} attempts to reconstruct the probability maps from their latent embeddings, using MSE with gradients detached from the targets:

LRecon=MSE(Gθ2(zY^),Y^  ⁣ ⁣no-grad)+MSE(Gθ2(zY),Y ⁣ ⁣no-grad)L_{Recon} = MSE( G_{\theta_2}(z_{\hat{Y}}), \hat{Y}\ |\!\!\text{no-grad}) + MSE(G_{\theta_2}(z_{Y}), Y\,|\!\!\text{no-grad})

  • CAR-Regularized Objective: The overall objective integrates these components, with recommended hyperparameters λDist=4\lambda_{Dist} = 4, λRecon=2\lambda_{Recon} = 2:

LTotal=LSeg+λDistLDist+λReconLReconL_{Total} = L_{Seg} + \lambda_{Dist} L_{Dist} + \lambda_{Recon} L_{Recon}

The CAR objective enforces not only pixelwise agreement on available labels, but also consistency in latent anatomical space, regularizing the model towards plausible multi-organ configurations even in the absence of ground-truth supervision.

3. Model Architecture and Integration

CAR is architecturally modular and can be integrated atop any segmentation backbone. For AnyCXR, the key design elements are as follows:

  • Segmentation Backbone (fϕf_\phi): 2D U-Net with ResNet-50 encoder pretrained on ImageNet, with the first convolution modified for single-channel CXRs.
  • CAR Autoencoder (GG):
    • Encoder (Gθ1G_{\theta_1}): Channel-wise concatenates XX and the probability map PP, applies two convolutional layers (Conv3×3 → BN → ReLU), max pooling, and outputs zz.
    • Decoder (Gθ2G_{\theta_2}): Two transposed convolutions (4×4 kernel, stride 2, padding 1), BN and ReLU activations, followed by a 1×1 convolution and sigmoid activation to produce a reconstructed probability map.

This architectural configuration ensures that the latent embedding zz encodes both the image appearance and multi-organ probability structure. Conditioning via concatenation [X;P][X;P] is central to enabling this joint representation.

4. Training Regimen and Implementation Considerations

The CAR-enhanced training procedure utilizes large-scale, partially labeled synthetic data and applies several practical optimizations:

  • Data Pipeline: 2,982 CT volumes are used to generate approximately 126,000 DRRs across nine standard CXR views. All segmentation masks align with their DRRs via deterministic multi-stage domain randomization (MSDR) processes, ensuring consistent spatial correspondences.
  • Optimization: Adam optimizer, initial learning rate 1×1041 \times 10^{-4}, halved at epochs 50, 80, and 128, with effective batch size 24 (using gradient accumulation ×4), and global gradient norm clipping to 1.0. Mixed-precision training is employed for computational stability.
  • Two-stage Training: Initial 200-epoch warm-up with only the segmentation loss LSegL_{Seg}, followed by 200 epochs of joint training using the full LTotalL_{Total}. Five-fold cross-validation is used with a fixed random seed (42). Importantly, there are no samples with complete annotations—each DRR typically contains only a subset of the 54 classes, and CAR exploits all available supervision via the mask MM.
  • Stability and Regularization: Warm-up, gradient norm clipping, and mixed-precision are critical for robust convergence, given the scale and partial nature of the annotation.

The table below summarizes primary architectural and training elements:

Component Detail Purpose
Backbone 2D U-Net (ResNet-50 encoder) Segmentation prediction
CAR encoder (Gθ1G_{\theta_1}) Conv3×3 ×2, BN, ReLU, maxpool, concat [X;P][X;P] Latent space embedding
CAR decoder (Gθ2G_{\theta_2}) Transposed conv ×2, BN, ReLU, 1×1 conv + sigmoid Probability map reconstruction
Training 2-stage (warm-up then joint), Adam, mixed precision, grad norm 1.0 Stability and regularization

5. Experimental Results and Empirical Contributions

Performance was evaluated both on synthetic and real CXR test sets, with strong results for zero-shot generalization using only synthetic training data:

  • On synthetic DRR test sets (TotalSegmentator): In PA view: mean DSC = 92.5%, Hausdorff Distance (HD) ≈ 10.7 px; in LA view: mean DSC = 88.8%, HD ≈ 20.4 px.
  • Real CXR datasets (ChestX-ray14, CheXpert, MIMIC, Shenzhen TB): In PA bone group, DSC ≈ 97.8%; LA view (lungs), DSC ≈ 98.2%.
  • Ablation studies: CAR consistently adds +0.5–1.5% DSC over prior strong baselines (MSDR + full augmentation), with greatest gains in lateral ribs (+1.51%) and lungs (+0.57%).
  • Hausdorff metrics: Improvements parallel DSC gains, particularly in organ overlap/ambiguity regions.

The effect of CAR compared to other approaches is summarized:

Setting PA macro-DSC LA macro-DSC
Plain UNet 91.6% 85.98%
+PostHoc augmentation 93.8% 89.16%
+MSDR 94.0% 87.60%
+Full Augmentation 94.5% 91.35%
+CAR ≈95.0% ≈92.5%

CAR offers improved topological and shape consistency, especially where image-level losses alone underperform.

6. Implications, Limitations, and Future Directions

CAR’s conditional latent-space regularization enables segmentation backbones to leverage all available supervision, rather than discarding partially labeled samples. This is particularly advantageous for annotation scalability and robustness in cross-domain or multi-angle imaging.

Notable strengths include:

  • Utilization of all available annotations—each image contributes supervisory signal, harnessed via the mask MM.
  • Enhancement of anatomical plausibility, topological coherence, and boundary delineation beyond what standard pixel-wise losses achieve.
  • Architectural compatibility as a plug-in module atop existing backbones.

Limitations and areas for extension:

  • Synthetic label boundaries may deviate from clinician-annotated conventions (e.g., due to voxel-level jaggedness), potentially necessitating minimal fine-tuning on expert masks for full clinical adoption.
  • Physical realism in DRRs is current-limited: scatter, beam hardening, and detector blur are omitted, with possible improvement via integration of Monte Carlo simulations.
  • CAR is orthogonal to segmentation backbone design—emerging architectures (e.g., transformer-based models) or advanced losses (e.g., diffusion-based) may yield further gains.

A plausible implication is that future modifications incorporating richer synthetic realism and advanced backbones may further strengthen CAR’s utility, especially in settings requiring maximum clinical fidelity or in broader multi-organ, multi-modality segmentation tasks.

7. Context and Significance within the Field

Conditional Joint Annotation Regularization represents a substantial methodological advance in anatomy-aware image analysis, particularly for medical imaging tasks characterized by partial or imperfect supervision. Within the AnyCXR pipeline, CAR enabled the generalizable segmentation of 54 anatomical structures across a wide range of acquisition geometries, facilitating downstream clinical measurements such as cardiothoracic ratio estimation, spine curvature assessment, and disease classification.

By facilitating end-to-end learning from large, imperfectly labeled synthetic datasets, and promoting anatomical consistency in the latent embedding space, CAR offers a scalable pathway for robust, clinically relevant CXR segmentation and supports annotation-efficient development of next-generation medical imaging models (Zifei et al., 19 Dec 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Conditional Joint Annotation Regularization (CAR).