Dense Image-to-Image Network (DI2I)

Updated 12 November 2025

DI2I is a convolutional neural network that combines DenseNet and U-Net features to accurately segment multiple organs from digitally reconstructed radiographs.
It utilizes dense blocks and multi-scale skip connections to exploit feature reuse, achieving Dice scores above 89% in key organs.
The model is integrated as a fixed module within a Task-Driven GAN, enabling effective unsupervised domain adaptation from synthetic DRRs to clinical X-ray images.

The Dense Image-to-Image Network (DI2I) is a convolutional neural network architecture designed for high-precision multi-organ segmentation in medical radiology, specifically in segmenting Digitally Reconstructed Radiographs (DRRs) synthesized from 3D CT scans. DI2I integrates principles from U-Net and DenseNet architectures to exploit multi-scale features with dense connectivity, enabling accurate pixel-wise parsing in challenging image domains with complex anatomical structures.

1. Architecture and Model Design

DI2I adopts a U-Net–type encoder–decoder backbone, augmented by DenseNet-style “dense blocks” that perform extensive feature reuse and deep supervision. The design follows the Tiramisu variant (Jegou et al. 2017) with architecture details as follows:

Input: Single-channel DRR, $512\times512\times1$ .
Initial Convolution: $3\times3$ convolution, batch normalization, ReLU, 48 output channels.
Encoder Path: Four dense blocks (growth rate $k=16$ $k = 16$ ), each followed by “transition down” (BN–ReLU– $1\times1$ $1 \times 1$ conv + $2\times2$ $2 \times 2$ MaxPool).
- Dense Block 1: $N_1=4$ , 112 channels
- Dense Block 2: $N_2=5$ , 192 channels
- Dense Block 3: $N_3=7$ , 304 channels
- Dense Block 4: $N_4=10$ , 464 channels
Bottleneck: One dense block ( $N_5=12$ , 656 channels)
Decoder Path: Four “transition up” ( $3\times3$ transposed conv, stride 2), skip connections from encoder, followed by mirrored dense blocks.
Final classifier: $1\times1$ convolution producing 5 output channels: 1 for background, 4 for organs (lung, heart, liver, bone).

All convolutions are followed by batch normalization and ReLU. Each dense block implements layer-wise feature concatenation. Skip connections link encoder and decoder at each spatial resolution.

Layer	Output Size	Operation
Input	512×512×1	—
Conv_init	512×512×48	Conv3×3, BN, ReLU
Dense Block 1 (4)	512×512×112	{BN–ReLU–1×1 Conv–BN–ReLU–3×3 Conv}×4
Transition Down 1	256×256×112	BN–ReLU–1×1 Conv + MaxPool 2×2
Dense Block 2 (5)	256×256×192	...
...	...	...
Bottleneck Block (12)	32×32×656	...
...	...	...
Classifier	512×512×5	Conv1×1 → logits for 5 classes

2. Objective Function and Segmentation Loss

DI2I addresses multi-label organ segmentation as a set of four binary segmentation tasks (one per organ against background). For organ $i$ , let $x_0$ and $x_i$ be the predicted logits for background and organ $i$ at each pixel. The class probability is given by:

$p_i = \frac{\exp(x_i)}{\exp(x_0) + \exp(x_i)}$

Given ground-truth mask $y_i \in \{0,1\}$ and scalar weight $w_i$ , the segmentation loss is:

$L_{seg} = -\sum_{i=1}^4 w_i\left[\,y_i\log p_i + (1-y_i)\log(1-p_i)\,\right]$

The full training objective is:

$\min_\theta~ \mathbb{E}_{d\sim \mathrm{DRR}}\left[L_{seg}(d; \theta)\right]$

No additional weight decay, total variation, or regularization terms are employed.

3. Training Protocol on DRRs

The model is trained on DRRs synthesized from 815 CT volumes, each with manual multi-organ labels. The 3D organ masks are projected to 2D space to render pixel-aligned DRR segmentation maps.

Input/Output: 512×512 DRR → 512×512 segmentation map.
Optimizer: Adam, $\beta_1=0.5,\ \beta_2=0.999$ , initial learning rate $2\times10^{-4}$ .
Batch Size: 4
Epochs: 100
Data Augmentation: Random horizontal flips, in-plane rotations ( $\pm10^\circ$ ), intensity jitter ( $\pm10\%$ ).
Implementation: PyTorch, single NVIDIA GPU (12 GB VRAM).

This protocol leverages extensive data augmentation to compensate for anatomical and acquisition variability in real clinical scenarios.

4. Quantitative and Qualitative Segmentation Performance

On five-fold cross-validation with held-out DRRs, DI2I exhibits strong organ segmentation performance:

Organ	Dice (mean ± std %)
Lung	94.17 ± 1.7
Heart	92.3 ± 5.6
Liver	89.4 ± 6.1
Bone	91.0 ± 2.0

Qualitatively, the model produces sharp boundaries, correctly excludes small vessels, and delineates overlapping anatomical structures. These results indicate the architecture’s capacity for precise multi-class parsing in high-noise, high-overlap medical images.

5. Integration into Task-Driven GAN (TD-GAN) for Unsupervised X-ray Adaptation

DI2I is deployed as a frozen pre-trained module within the Task Driven Generative Adversarial Network (TD-GAN) to enable zero-label domain adaptation from synthetic DRRs to real, unpaired X-ray images.

TD-GAN Structure:

Core: CycleGAN-like image-to-image framework (Zhu et al. 2017)
Generators: $G_1$ (DRR→X-ray), $G_2$ (X-ray→DRR)
Discriminators: $D_1$ , $D_2$

Task-driven losses:

Conditional adversarial loss ( $\mathcal{L}_{XD}$ ): $D_2$ distinguishes real DRR $(d)$ from fake DRR $(G_2(x))$ conditioned on their DI2I mask outputs.
Cycle-segmentation consistency ( $\mathcal{L}_{seg-cyc}$ ): Enforces that $G_2(G_1(d))$ is both visually and segmentationally consistent with source DRR $d$ .

The total TD-GAN loss is a weighted combination:

$\mathcal{L}_{TD-GAN} = \lambda_1 \mathcal{L}_{DX} + \lambda_2 \mathcal{L}_{XD} + \lambda_3 \mathcal{L}_{XX} + \lambda_4 \mathcal{L}_{DD} + \lambda_5 \mathcal{L}_{seg-cyc}$

with weighting chosen as in CycleGAN (e.g., $\lambda_1 = 1, \lambda_3 = 10, \lambda_2 = \lambda_5 = 1$ ).

During TD-GAN training, DI2I is frozen (no parameter updates), ensuring transfer preserves organ boundaries and structures as discovered from DRRs.

6. Impact on Downstream Unsupervised X-ray Image Segmentation

On a held-out set of 60 clinical topogram X-rays (only used for evaluation), DI2I-targeted TD-GAN achieves high segmentation performance:

Setting	Mean Dice (%)
DI2I trained on DRRs, no adaptation	30.8
CycleGAN (image translation only)	80.8
TD-GAN with adversarial loss ( $L_{XD}$ )	82.4
TD-GAN with cycle-seg loss ( $L_{seg-cyc}$ )	84.4
TD-GAN with both task-driven losses	85.4
Fully supervised (on labeled topograms)	88.3

Qualitative assessment shows the TD-GAN framework closes the gap to fully supervised training (88.3%) without requiring any X-ray labels, faithfully restoring organ shapes and crisp anatomical boundaries in real X-ray images. The vanilla DI2I, lacking adaptation, fails to generalize, highlighting the necessity of explicit domain transfer mechanisms.

DI2I is a representative example of modern architectural advances in semantic segmentation—combining dense connectivity (DenseNet) and multi-scale skip connections (U-Net)—for robust medical image parsing. Its integration as a fixed task module within TD-GAN represents a distinctive strategy: leveraging model semantics to constrain generative domain adaptation and enforce anatomical correctness, rather than only relying on pixel or feature-level adversarial alignment.

Compared with conventional domain transfer pipelines (e.g., image translation followed by downstream segmentation), the task-driven approach achieves notably stronger transferability and anatomical fidelity. A plausible implication is that similar dense encoder-decoder architectures, when coupled with appropriate task-driven consistency objectives, can extend to other cross-modal segmentation and parsing tasks with scarce labels (Zhang et al., 2018).

PDF Markdown Chat (Pro)

References (1)

Task Driven Generative Modeling for Unsupervised Domain Adaptation: Application to X-ray Image Segmentation (2018)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Dense Image-to-Image Network (DI2I).