RDFace: A Benchmark Dataset for Rare Disease Facial Image Analysis under Extreme Data Scarcity and Phenotype-Aware Synthetic Generation

Published 3 Apr 2026 in cs.CV and cs.AI | (2604.03454v1)

Abstract: Rare diseases often manifest with distinctive facial phenotypes in children, offering valuable diagnostic cues for clinicians and AI-assisted screening systems. However, progress in this field is severely limited by the scarcity of curated, ethically sourced facial data and the high similarity among phenotypes across different conditions. To address these challenges, we introduce RDFace, a curated benchmark dataset comprising 456 pediatric facial images spanning 103 rare genetic conditions (average 4.4 samples per condition). Each ethically verified image is paired with standardized metadata. RDFace enables the development and evaluation of data-efficient AI models for rare disease diagnosis under real-world low-data constraints. We benchmark multiple pretrained vision backbones using cross-validation and explore synthetic augmentation with DreamBooth and FastGAN. Generated images are filtered via facial landmark similarity to maintain phenotype fidelity and merged with real data, improving diagnostic accuracy by up to 13.7% in ultra-low-data regimes. To assess semantic validity, phenotype descriptions generated by a vision-LLM from real and synthetic images achieve a report similarity score of 0.84. RDFace establishes a transparent, benchmark-ready dataset for equitable rare disease AI research and presents a scalable framework for evaluating both diagnostic performance and the integrity of synthetic medical imagery.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces RDFace, a medically curated benchmark dataset addressing extreme data scarcity in rare disease diagnosis.
It benchmarks six supervised classification models and few-shot learning protocols to quantify diagnostic accuracy in low-data scenarios.
It demonstrates that phenotype-aware synthetic data generation with DreamBooth significantly enhances AI performance while preserving clinical relevance.

RDFace: A Benchmark for Rare Disease Facial Analysis Under Data Scarcity

Motivation and Dataset Construction

RDFace addresses a persistent challenge in rare disease diagnosis: extreme data scarcity and high phenotype similarity across conditions. The dataset comprises 456 pediatric frontal facial images, representing 103 rare genetic diseases (mean 4.4 images per class, min 1, max 7), each accompanied by standardized metadata (disease, gene, Orphanet code, etc.). The collection process was rigorously ethical, relying on cross-verification with Orphanet and expert clinical review to ensure medical plausibility and correct labeling.

Emphasis was placed on children under age 12 to minimize confounding and maximize disease-relevant facial features. Demographic representation spans 46 countries, with careful attention to geographical diversity but inevitably limited by the global rarity and regional concentration of certain syndromes.

The class-structured directory and standardized metadata facilitate reproducibility and ease benchmark adoption. Clinical geneticists supervised image curation and independent review by two clinical fellows confirmed image-label associations.

Methodological Framework

RDFace enables systematic evaluation of AI models for rare disease diagnosis under ultra-low-shot and data scarcity scenarios, with three core components:

Supervised classification: Six backbone architectures (ResNet-152, DenseNet-169, FaceNet, VGG-16, Swin Transformer, and CLIP ViT-B/32) were benchmarked, with cross-validation and strict train/test splits. Singleton classes are used as training only. Top-k accuracy was the primary metric.
Few-shot learning: Prototypical Networks were configured for n-way 1-shot classification, excluding singleton classes. Results are averaged over 5-fold cross-validation (5-way, 10-way, 15-way settings).
Synthetic data generation and augmentation: DreamBooth (diffusion-based, class-conditioned) and FastGAN (unconditional, class-agnostic) generated synthetic faces. Landmark-based similarity and clinical expert review evaluated phenotype fidelity; VLMs (Qwen2.5-VL, LLaVA-NEXT) generated phenotype reports for semantic validation.

The pipeline includes preprocessing steps (super-resolution, colorization), automated quality filtering (RetinaFace and LPIPS metrics), and augmenting both training and support sets for downstream tasks.

Baseline and Comparative Results

Supervised Classification

DenseNet-169 yielded the highest Top-1 accuracy (15.93%) in real-only data, Swin Transformer reached 14.34%, and VGG-16 achieved 11.68%. Performance improved with higher Top-k (e.g. DenseNet Top-5: 33.63%, Top-30: 64.42%). CLIP and FaceNet underperformed, indicating limited utility of language-based and face embedding models for this domain.

Few-shot Learning

Few-shot episodic configurations markedly improved accuracy in low-data settings. DenseNet reached 26.20% in 5-way 1-shot, with ResNet providing more stable results as k increased. Performance decayed with more "ways," underscoring the challenge of differentiation within rare disease classes due to high inter-class similarity and intra-class variance.

Synthetic Data Generation and Evaluation

DreamBooth vs FastGAN

DreamBooth’s class-conditioned generation (prompted with disease-specific text, fine-tuned per class) consistently produced synthetic faces with high clinical plausibility, as validated by facial landmark analysis (mean rank: 19.74) and expert review (up to 76% judged plausible, Cohen’s K = 0.654). FastGAN images were markedly less clinically plausible (only 2-38% plausible, K = 0.069), and failed to preserve phenotype fidelity.

RetinaFace and LPIPS confirmed high visual quality of both models, but DreamBooth samples exhibited superior phenotype alignment. Filtering by landmark similarity effectively prioritized structurally coherent synthetic samples across diseases.

Semantic Fidelity via VLMs

Vision-LLMs generated structured phenotype descriptions for both real and synthetic faces, achieving high semantic similarity scores (BioBERT overall: 0.84). Region-wise scores showed strongest alignment in nose and eyes; minor discrepancies appeared in lip and mouth description consistency. Stochastic sampling demonstrated robustness of VLM-generated reports, and cross-model comparison with LLaVA-NEXT confirmed evaluation invariance to VLM choice.

Scaling Synthetic Augmentation

DreamBooth augmentation improved model accuracy nonlinearly, with performance plateauing as synthetic sample size increased (e.g. DenseNet Top-1 accuracy rose from 15.93% to 21.06% at Top-6000). FastGAN augmentation consistently degraded performance, reinforcing the necessity of phenotype-aware, class-conditioned generation. MixUp and CutMix, as generic augmentation baselines, provided limited value (Top-1 accuracy ~16%).

Impact on Few-shot Learning

DreamBooth augmentation enhanced few-shot learning performance across most backbones and settings, with DenseNet reaching 29.88% in 5-way 1-shot. Greater gains were observed with increased support set size (5-shot configurations).

Theoretical and Practical Implications

RDFace rigorously characterizes the technical limitations of current vision models for rare disease diagnosis under extreme data scarcity, demonstrating the inadequacy of generic augmentation and unconditional synthetic generation. The dataset enables reproducible benchmarking for both classification and few-shot tasks. Results indicate that phenotype-aligned synthetic data, coupled with landmark-based filtering, meaningfully improves diagnostic accuracy while retaining clinical relevance.

The semantic evaluation with VLMs offers a scalable framework for multimodal phenotype interpretation and assessment, with potential for future clinical integration and educational application.

Bias analysis confirmed broadly consistent performance across geographic regions, but demographic metadata limitations remain—a challenge for future dataset collection and cross-population generalizability.

Conclusion

RDFace establishes a transparent, medically curated benchmark for rare disease facial image analysis, supporting systematic evaluation under real-world low-data constraints. DreamBooth-based phenotype-aware synthetic augmentation is effective in both supervised and few-shot learning, improving diagnostic accuracy by up to 13.7% and preserving clinically relevant features. The integration of structural landmark metrics and semantic VLM descriptions provides robust multi-faceted validation of synthetic image integrity.

Future directions include expansion of curated, demographically diverse datasets, refinement of phenotype-aware generation, exploration of VLMs for clinical interpretability, and further theoretical analysis of signal-to-noise tradeoffs in ultra-low-shot medical imaging AI.

RDFace provides the community with foundational infrastructure to advance equitable, reproducible, and clinically relevant AI for rare disease diagnosis (2604.03454).

Markdown Report Issue