A Utility-preserving De-identification Pipeline for Cross-hospital Radiology Data Sharing

Published 8 Apr 2026 in cs.CV | (2604.07128v1)

Abstract: Large-scale radiology data are critical for developing robust medical AI systems. However, sharing such data across hospitals remains heavily constrained by privacy concerns. Existing de-identification research in radiology mainly focus on removing identifiable information to enable compliant data release. Yet whether de-identified radiology data can still preserve sufficient utility for large-scale vision-LLM training and cross-hospital transfer remains underexplored. In this paper, we introduce a utility-preserving de-identification pipeline (UPDP) for cross-hospital radiology data sharing. Specifically, we compile a blacklist of privacy-sensitive terms and a whitelist of pathology-related terms. For radiology images, we use a generative filtering mechanism that synthesis a privacy-filtered and pathology-reserved counterparts of the original images. These synthetic image counterparts, together with ID-filtered reports, can then be securely shared across hospitals for downstream model development and evaluation. Experiments on public chest X-ray benchmarks demonstrate that our method effectively removes privacy-sensitive information while preserving diagnostically relevant pathology cues. Models trained on the de-identified data maintain competitive diagnostic accuracy compared with those trained on the original data, while exhibiting a marked decline in identity-related accuracy, confirming effective privacy protection. In the cross-hospital setting, we further show that de-identified data can be combined with local data to yield better performance.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper proposes UPDP, a novel pipeline using blacklist/whitelist semantic filtering and controlled diffusion to securely share radiological image–report pairs.
It demonstrates significant identity suppression (reducing classifier accuracy to 4.57%) while maintaining diagnostic fidelity on benchmarks like BLEU, BERTScore, and RadGraph F1.
Iterative prompt optimization and multimodal alignment ensure that synthetic data closely mirrors real data, enabling robust vision-language model training.

Utility-Preserving De-identification for Cross-Hospital Radiology Data Transfer

Introduction and Motivation

Cross-institutional collaboration in medical AI relies heavily on large, diverse radiology datasets. However, stringent privacy regulations and institutional policies significantly inhibit the sharing of radiological images and associated reports (Figure 1). Identifiable information embedded in both imaging data and text—ranging from burned-in annotations to clinical notes—creates persistent privacy risks, impeding data transfer even when standard de-identification measures are applied. Existing research has predominantly focused on eliminating identifiers, but the preservation of diagnostic utility, particularly for vision-LLM (VLM) training, is insufficiently addressed.

Figure 1: Limitations of real radiology data for cross-hospital model training due to privacy restrictions and institutional barriers.

The work proposes a Utility-Preserving De-identification Pipeline (UPDP) that systematically segregates privacy-sensitive tokens (using a blacklist) from clinically essential, pathology-related concepts (using a whitelist). For images, generative filtering produces privacy-filtered, pathology-preserved synthetic counterparts. This approach enables the secure sharing of image–report pairs across institutions with minimal compromise to downstream VLM development.

Pipeline Architecture and Methodology

UPDP introduces a controlled image–text generation scheme, formulating de-identification as a guided synthesis problem for both images and associated radiology reports. The pipeline consists of:

Semantic Filtering: Reports undergo lexical constraints via blacklist exclusion (explicit patient identifiers, dates, locations, demographic cues) and whitelist soft promotion (anatomical and pathology descriptors).
Multimodal Alignment: Vision and text representations are extracted using a pretrained VLM backbone, followed by content embedding optimization to align semantic features across modalities.
Diffusion-based Synthesis: Optimized, privacy-compliant prompts condition a diffusion model to yield chest X-rays that suppress identity attributes while retaining structure and pathology alignment.
Figure 2: Overview of the UPDP pipeline, leveraging multimodal feature extraction, content optimization, and controlled diffusion for de-identified image generation.

The algorithm employs iterative prompt tuning and constrained token selection to achieve competitive semantic fidelity. The continuous report embeddings are projected onto the discrete token space, and the resultant de-identified content guides the image generator. Content length and prompt initialization are shown—via ablation—to influence downstream fidelity.

Experimental Results

Privacy Protection and Utility Preservation

De-identification effectiveness was quantitatively validated via identity classification tasks. A model trained to predict patient identity using original, report-only generated, and UPDP-generated images achieved accuracies of 97.21%, 13.69%, and 4.57% respectively. This indicates substantial suppression of identity-specific features in UPDP output, outperforming report-only baselines, which may still indirectly encode identifiers.

Performance on standard report generation benchmarks (MIMIC-CXR, IU X-Ray) demonstrates that VLMs trained using UPDP-generated synthetic data closely match those trained on origin data across BLEU, METEOR, ROUGE-L, BERTScore, and RadGraph F1. For example, on MIMIC-CXR, BLEU-1 improved from 11.34 (report-only synthetic) to 13.62 (UPDP synthetic) for transferred data. Augmenting local data with UPDP-processed transfer samples yields further improvement in all evaluated metrics, substantiating the hypothesis that UPDP preserves essential diagnostic signals needed for robust model training.

Figure 3: Left—De-identification reduces identity-classifier accuracy. Right—UPDP maintains high Bert Score and RadGraph F1, evidencing semantic and clinical utility.

Optimization and Content Analysis

Iterative prompt optimization enhances image–text alignment and visual fidelity, with improvements in SSIM observed across iterations (Figure 4). Prompt length analysis reveals an inverted U-shaped relationship in CLIP alignment as tokens increase (Figure 5), affirming that moderate prompt sizes balance utility and generalization.

Figure 4: Structural similarity (SSIM) to real images increases with more content optimization iterations.

Figure 5: CLIP scores peak at moderate prompt lengths, indicating optimal trade-off between detail and overfitting.

Qualitative comparison further reveals that images generated from optimized UPDP content better preserve anatomical morphology, radiographic structure, and fine-grained pathological findings compared to those from raw, merely de-identified reports (Figure 6).

Figure 6: Optimized contents yield synthetic chest X-rays with higher anatomical and pathological fidelity.

Implications and Future Directions

This work demonstrates that synthetic, privacy-preserving radiology datasets generated with lexically constrained, content-optimized diffusion models can serve as high-utility substitutes or augmentations for real data in VLM training. The strong numerical results—namely, the minimization of identity leakage to 4.57% classifier accuracy while preserving BLEU-1, BertScore, and RadGraph F1 to within a few points of original data—substantiates the potential of generative privacy-preservation approaches for institutional data sharing restrictions.

The findings suggest several trajectories for future research:

Hybrid Training Regimes: Combining a small seed of real images with large-scale UPDP-synthesized data delivers optimal performance, likely stabilizing against any potential artifacts from generative modeling.
Expert and Privacy Evaluation: Automated metrics offer a partial view; integration with expert radiologist assessment and rigorous privacy auditing, such as membership inference and re-identification risk, remains essential.
Domain Generalization: Adapting UPDP to diverse imaging modalities beyond chest X-ray, leveraging stronger medical priors and more robust multimodal representations, will further broaden applicability.
Diffusion Model Advances: Further investigation into diffusion-based medical image synthesis may yield even better fidelity–privacy trade-offs as models and hardware improve.

Conclusion

UPDP provides a scalable, practical solution for cross-institutional radiology data sharing, removing sensitive information at both the lexical and visual level while preserving clinical and diagnostic value for downstream AI development. The pipeline's strong numerical performance and extensibility to hybrid training scenarios underline its potential for real-world adoption, particularly in data-restricted medical AI environments. This work advances the technical frontier of privacy-aware medical data generation and lays a robust foundation for broader, federated training of medical VLMs without compromising regulatory compliance or diagnostic performance (2604.07128).

Markdown Report Issue