DeepLense ML4SCI Benchmark
- DeepLense ML4SCI Benchmark is a simulated dataset and task suite for studying strong gravitational lensing, featuring controlled data splits and explicit performance metrics.
- It supports two key tasks—classifying dark matter substructure (no_sub, cdm, axion) and performing image super-resolution on low-SNR astrophysical images using standardized evaluation metrics.
- The benchmark leverages MAE pretraining on a ViT encoder to enhance both classification and super-resolution, with ablation studies demonstrating trade-offs between mask ratios and model performance.
The DeepLense ML4SCI benchmark is a simulated image dataset and task suite designed to facilitate the development and rigorous evaluation of machine learning models for strong gravitational lensing analysis. It provides a controlled setting for two key downstream tasks: classification of dark-matter substructure models and image super-resolution. Generated using the lenstronomy package, its well-defined data splits, explicit performance metrics, and extensible framework position it as a foundational resource for accelerated ML4SCI (machine learning for science), specifically in dark matter substructure inference and low-SNR astrophysical imaging contexts (Prasha et al., 7 Dec 2025).
1. Dataset Composition and Simulation Protocol
The DeepLense ML4SCI benchmark encompasses two principal datasets targeting classification and super-resolution objectives. All images are strong lensing surface-brightness maps of galaxy–galaxy systems, produced with lenstronomy [Birrer & Amara 2018]. No additional information is reported regarding the lens mass profile, source properties, redshift distribution, noise model, or instrumental PSF assumptions.
- Classification Dataset (Dataset1):
- Three classes based on dark matter substructure:
- no_sub (smooth lens, no substructure): 29,449 images
- cdm (ΛCDM-like subhalos): 29,759 images
- axion (axion/wave-induced substructure): 29,896 images
- Image format: single-channel, 64×64, stored as .npy arrays.
- Each class split stratified 90% train / 10% test.
- Super-Resolution Dataset (Dataset2):
- Pairs of 10,000 LR (16×16) and HR (64×64) images, derived solely from the no_sub class.
- 90% train / 10% test split.
- For training, each LR image is upsampled via nearest-neighbor interpolation to 64×64 before patch embedding.
- Patch size yields image patches.
2. Downstream Tasks
The benchmark delineates two supervised tasks with distinct objectives and loss functions:
- Dark-Matter Model Classification
- Objective: Assign each 64×64 image to one of three classes (no_sub, cdm, axion).
- Loss: Multi-class cross-entropy,
where is the binary class indicator and the predicted probability. - Evaluation: One-vs-rest ROC curves, macro-averaged AUC, overall accuracy, and macro-F1.
- Super-Resolution
- Objective: Map low-resolution (16×16) lensing images to ground-truth high-resolution (64×64) targets.
- Loss: Pixel-wise mean squared error (MSE),
where is the task-specific decoder and the encoder.
3. Evaluation Metrics and Baseline Performance
Standardized metrics and explicit baseline results enable direct, reproducible comparisons:
- Classification Metrics:
- Accuracy:
- Macro-Averaged AUC: Unweighted mean of the three one-vs-rest AUCs. - Macro-F1: Arithmetic mean of per-class F1 scores. - ROC AUC: with - -
- Super-Resolution Metrics:
- MSE:
- PSNR (Peak Signal-to-Noise Ratio):
- SSIM (Structural Similarity):
where , are local patch statistics, stabilizers.
Baselines and reference performance on 10-epoch fine-tuning:
| Model/Setting | CLS AUC | CLS Acc | Macro-F1 | SR MSE | SR PSNR | SR SSIM |
|---|---|---|---|---|---|---|
| MAE, 75%, fine-tune | 0.9232 | 0.6722 | 0.6332 | 0.000522 | 33.05 | 0.9610 |
| MAE, 75%, frozen | 0.5365 | 0.3406 | 0.2711 | — | — | — |
| Scratch (ViT) | 0.9567 | 0.8246 | 0.8177 | 0.000523 | 33.01 | 0.9552 |
4. Masked Autoencoder (MAE) Pretraining and Ablation Studies
The cited study (Prasha et al., 7 Dec 2025) pretrains a Vision Transformer (ViT) encoder using the MAE strategy on DeepLense no_sub images. Details of the pretraining pipeline include:
- Pretraining Data: only the no_sub (smooth lens) subset, totaling 29,449 images.
- Patch size: , patches per image.
- Model: 6-block ViT encoder (3 attention heads, embedding dim 192) and 2-block transformer decoder.
- Masking: Mask ratios , default .
- MAE Loss:
with the set of masked patches.
Mask ratio ablations systematically demonstrate a trade-off: higher mask ratios () lead to higher classification accuracy but reduced image reconstruction (PSNR), as tabulated below.
| Mask Ratio (M) | MAE Loss | CLS AUC | CLS Acc | SR PSNR | SR SSIM |
|---|---|---|---|---|---|
| 0.50 | 0.0026 | 0.9502 | 0.7936 | 34.01 | 0.9544 |
| 0.75 | 0.0029 | 0.9232 | 0.6722 | 33.05 | 0.9610 |
| 0.90 | 0.0045 | 0.9681 | 0.8865 | 32.65 | 0.9550 |
Optimal settings: For pure classification, a high mask ratio (90%) is recommended; for super-resolution, a lower ratio (50%) is preferred; for balanced performance, .
5. Integration of DeepLense ML4SCI in Joint Task Pipelines
The benchmark is instrumental in demonstrating that a single, MAE-pretrained encoder can be fine-tuned for both classification and super-resolution tasks without retraining from scratch. The use of a unified transformer backbone, with task-specific heads for classification and super-resolution, facilitates parameter sharing and efficient adaptation.
At , MAE-pretraining yields a classifier surpassing the scratch ViT baseline for macro-AUC (0.9681 vs 0.9567) and accuracy (88.65% vs 82.46%). For super-resolution, the pretrained model achieves marginally higher PSNR (33.05 dB vs 33.01 dB) and SSIM (0.9610 vs 0.9552) compared to a scratch-trained baseline. This suggests that MAE-pretraining on simulated, physics-rich datasets can enhance both generalization and efficiency for scientific vision tasks.
6. Limitations and Reporting Gaps
The published details omit specifics on the physical modeling and data realism, including the type of lens mass profile, precise subhalo parameterizations, source light profiles, redshift distributions, image noise characterization, and PSF modeling. A plausible implication is that the benchmark is tailored toward model comparison and ML-methodology benchmarking rather than exhaustive astrophysical forward modeling. Absence of augmentation or observational systematics may affect transferability to real data or broader scientific conclusions.
7. Significance and Future Directions
The DeepLense ML4SCI benchmark serves as an explicit testbed for machine learning strategies in gravitational lensing, with an orientation toward model-flexible, multi-task learning demonstrated through masked autoencoder pretraining. Its task design, data structure, and reporting of performance under ablated training regimes provide a reproducible benchmark for further research in both astrophysical inference and ML methodology.
Adoption of such benchmarks is likely to standardize comparative analysis and catalyze innovation in self-supervised, simulation-based pretraining for ML4SCI. However, future refinements should address the currently missing details of the data-generation process to further enhance astrophysical realism and downstream applicability (Prasha et al., 7 Dec 2025).