CAR: Conditional Joint Annotation Regularization

Updated 26 December 2025

The paper introduces CAR, a method that leverages incomplete annotations to achieve robust segmentation in chest X-rays with competitive DSC scores.
CAR employs latent-space regularization via an autoencoder, enforcing anatomical consistency by integrating segmentation, distribution, and reconstruction losses.
The framework integrates with a 2D U-Net backbone and shows improved scalability and accuracy on synthetic and real medical imaging datasets.

Conditional Joint Annotation Regularization (CAR) is a learning strategy designed to enable robust multi-organ segmentation in medical imaging scenarios where ground-truth annotations are sparse, partially available, or noisy. Originally introduced within the AnyCXR framework for chest X-ray (CXR) analysis, CAR addresses the challenge of leveraging large-scale synthetic datasets exhibiting incomplete or imperfect labels, enforcing anatomical consistency through latent space regularization. The approach enables end-to-end training without discarding partially labeled samples, maintaining accuracy and anatomical integrity even under diverse image acquisition conditions (Zifei et al., 19 Dec 2025).

1. Problem Formulation and Motivation

CAR addresses the segmentation of $C$ anatomical classes in chest X-rays $X \in \mathbb{R}^{H \times W}$ , where model predictions $\hat{Y} = f_\phi(X)$ are $C$ -channel probability maps. The key challenge is that for each image, only a subset of classes is annotated. This partial annotation is tracked by a binary availability mask $M \in \{0,1\}^{B \times C}$ (for batch size $B$ ), such that $M^{(b,c)} = 1$ if class $c$ is labeled in sample $b$ . Labels $Y_{gt}$ may arise from automated CT-based algorithms and are often incomplete or imprecise. CAR is designed to exploit all available labels (even if partial) by operating on tuples $\{X, \hat{Y}, M, Y_{gt}\}$ without assuming label completeness.

This formulation is instrumental for large-scale synthetic pipelines (e.g., generation of 126,000 DRRs from 2,982 CT volumes), in which the cost or feasibility of exhaustive manual annotation is prohibitive. CAR thus enables scalable anatomy-aware learning from imperfectly labeled synthetic data (Zifei et al., 19 Dec 2025).

2. CAR Objective: Latent Consistency and Supervision with Incomplete Labels

The CAR objective function integrates several loss terms, each serving a particular facet of the learning problem:

Segmentation Loss ( $L_{Seg}$ ): Computes the mean Dice similarity coefficient (DSC) over only the available labels, as indicated by $M$ :

$L_{Seg} = \frac{1}{\sum_{b,c} M^{(b,c)}} \sum_{b=1}^B \sum_{c=1}^C M^{(b,c)} \cdot \text{Dice}\bigl(\hat{Y}^{(b,c)}, Y_{gt}^{(b,c)}\bigr)$

Reliable Target Construction: For each class/sample pair, CAR constructs a target $Y^{(b,c)}$ defined as:

$Y^{(b,c)} = \begin{cases} Y_{gt}^{(b,c)} & \text{if } M^{(b,c)} = 1 \ \hat{Y}^{(b,c)} \;|\!\!\text{no-grad} & \text{otherwise} \end{cases}$

This formulation ensures that gradients are not propagated through unobserved labels, avoiding spurious learning signals.

Latent-space Distribution Loss ( $L_{Dist}$ ): An encoder $G_{\theta_1}$ maps $[X;P]$ (where $P$ is either a probability map or label mask) to a latent embedding $z \in \mathbb{R}^d$ . The loss enforces that the predicted segmentation and reliable target produce similar latent codes via cosine distance:

$L_{Dist} = \frac{1}{B} \sum_{b=1}^B \left[ 1 - \frac{z_{\hat{Y}}^{(b)} \cdot z_{Y}^{(b)}}{\|z_{\hat{Y}}^{(b)}\| \| z_{Y}^{(b)} \|} \right]$

Reconstruction Loss ( $L_{Recon}$ ): A decoder $G_{\theta_2}$ attempts to reconstruct the probability maps from their latent embeddings, using MSE with gradients detached from the targets:

$L_{Recon} = MSE( G_{\theta_2}(z_{\hat{Y}}), \hat{Y}\ |\!\!\text{no-grad}) + MSE(G_{\theta_2}(z_{Y}), Y\,|\!\!\text{no-grad})$

CAR-Regularized Objective: The overall objective integrates these components, with recommended hyperparameters $\lambda_{Dist} = 4$ , $\lambda_{Recon} = 2$ :

$L_{Total} = L_{Seg} + \lambda_{Dist} L_{Dist} + \lambda_{Recon} L_{Recon}$

The CAR objective enforces not only pixelwise agreement on available labels, but also consistency in latent anatomical space, regularizing the model towards plausible multi-organ configurations even in the absence of ground-truth supervision.

3. Model Architecture and Integration

CAR is architecturally modular and can be integrated atop any segmentation backbone. For AnyCXR, the key design elements are as follows:

Segmentation Backbone ( $f_\phi$ ): 2D U-Net with ResNet-50 encoder pretrained on ImageNet, with the first convolution modified for single-channel CXRs.
CAR Autoencoder ( $G$ ):
- Encoder ( $G_{\theta_1}$ ): Channel-wise concatenates $X$ and the probability map $P$ , applies two convolutional layers (Conv3×3 → BN → ReLU), max pooling, and outputs $z$ .
- Decoder ( $G_{\theta_2}$ ): Two transposed convolutions (4×4 kernel, stride 2, padding 1), BN and ReLU activations, followed by a 1×1 convolution and sigmoid activation to produce a reconstructed probability map.

This architectural configuration ensures that the latent embedding $z$ encodes both the image appearance and multi-organ probability structure. Conditioning via concatenation $[X;P]$ is central to enabling this joint representation.

4. Training Regimen and Implementation Considerations

The CAR-enhanced training procedure utilizes large-scale, partially labeled synthetic data and applies several practical optimizations:

Data Pipeline: 2,982 CT volumes are used to generate approximately 126,000 DRRs across nine standard CXR views. All segmentation masks align with their DRRs via deterministic multi-stage domain randomization (MSDR) processes, ensuring consistent spatial correspondences.
Optimization: Adam optimizer, initial learning rate $1 \times 10^{-4}$ , halved at epochs 50, 80, and 128, with effective batch size 24 (using gradient accumulation ×4), and global gradient norm clipping to 1.0. Mixed-precision training is employed for computational stability.
Two-stage Training: Initial 200-epoch warm-up with only the segmentation loss $L_{Seg}$ , followed by 200 epochs of joint training using the full $L_{Total}$ . Five-fold cross-validation is used with a fixed random seed (42). Importantly, there are no samples with complete annotations—each DRR typically contains only a subset of the 54 classes, and CAR exploits all available supervision via the mask $M$ .
Stability and Regularization: Warm-up, gradient norm clipping, and mixed-precision are critical for robust convergence, given the scale and partial nature of the annotation.

The table below summarizes primary architectural and training elements:

Component	Detail	Purpose
Backbone	2D U-Net (ResNet-50 encoder)	Segmentation prediction
CAR encoder ( $G_{\theta_1}$ )	Conv3×3 ×2, BN, ReLU, maxpool, concat $[X;P]$	Latent space embedding
CAR decoder ( $G_{\theta_2}$ )	Transposed conv ×2, BN, ReLU, 1×1 conv + sigmoid	Probability map reconstruction
Training	2-stage (warm-up then joint), Adam, mixed precision, grad norm 1.0	Stability and regularization

5. Experimental Results and Empirical Contributions

Performance was evaluated both on synthetic and real CXR test sets, with strong results for zero-shot generalization using only synthetic training data:

On synthetic DRR test sets (TotalSegmentator): In PA view: mean DSC = 92.5%, Hausdorff Distance (HD) ≈ 10.7 px; in LA view: mean DSC = 88.8%, HD ≈ 20.4 px.
Real CXR datasets (ChestX-ray14, CheXpert, MIMIC, Shenzhen TB): In PA bone group, DSC ≈ 97.8%; LA view (lungs), DSC ≈ 98.2%.
Ablation studies: CAR consistently adds +0.5–1.5% DSC over prior strong baselines (MSDR + full augmentation), with greatest gains in lateral ribs (+1.51%) and lungs (+0.57%).
Hausdorff metrics: Improvements parallel DSC gains, particularly in organ overlap/ambiguity regions.

The effect of CAR compared to other approaches is summarized:

Setting	PA macro-DSC	LA macro-DSC
Plain UNet	91.6%	85.98%
+PostHoc augmentation	93.8%	89.16%
+MSDR	94.0%	87.60%
+Full Augmentation	94.5%	91.35%
+CAR	≈95.0%	≈92.5%

CAR offers improved topological and shape consistency, especially where image-level losses alone underperform.

6. Implications, Limitations, and Future Directions

CAR’s conditional latent-space regularization enables segmentation backbones to leverage all available supervision, rather than discarding partially labeled samples. This is particularly advantageous for annotation scalability and robustness in cross-domain or multi-angle imaging.

Notable strengths include:

Utilization of all available annotations—each image contributes supervisory signal, harnessed via the mask $M$ .
Enhancement of anatomical plausibility, topological coherence, and boundary delineation beyond what standard pixel-wise losses achieve.
Architectural compatibility as a plug-in module atop existing backbones.

Limitations and areas for extension:

Synthetic label boundaries may deviate from clinician-annotated conventions (e.g., due to voxel-level jaggedness), potentially necessitating minimal fine-tuning on expert masks for full clinical adoption.
Physical realism in DRRs is current-limited: scatter, beam hardening, and detector blur are omitted, with possible improvement via integration of Monte Carlo simulations.
CAR is orthogonal to segmentation backbone design—emerging architectures (e.g., transformer-based models) or advanced losses (e.g., diffusion-based) may yield further gains.

A plausible implication is that future modifications incorporating richer synthetic realism and advanced backbones may further strengthen CAR’s utility, especially in settings requiring maximum clinical fidelity or in broader multi-organ, multi-modality segmentation tasks.

7. Context and Significance within the Field

Conditional Joint Annotation Regularization represents a substantial methodological advance in anatomy-aware image analysis, particularly for medical imaging tasks characterized by partial or imperfect supervision. Within the AnyCXR pipeline, CAR enabled the generalizable segmentation of 54 anatomical structures across a wide range of acquisition geometries, facilitating downstream clinical measurements such as cardiothoracic ratio estimation, spine curvature assessment, and disease classification.

By facilitating end-to-end learning from large, imperfectly labeled synthetic datasets, and promoting anatomical consistency in the latent embedding space, CAR offers a scalable pathway for robust, clinically relevant CXR segmentation and supports annotation-efficient development of next-generation medical imaging models (Zifei et al., 19 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

AnyCXR: Human Anatomy Segmentation of Chest X-ray at Any Acquisition Position using Multi-stage Domain Randomized Synthetic Data with Imperfect Annotations and Conditional Joint Annotation Regularization Learning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Conditional Joint Annotation Regularization (CAR).