Papers
Topics
Authors
Recent
Search
2000 character limit reached

FLARE 2023: CT & Solar Flare Benchmarking

Updated 26 April 2026
  • FLARE 2023 Challenge is a dual benchmarking initiative advancing universal abdomen CT segmentation and rare-event solar flare prediction through standardized large-scale datasets.
  • The competition emphasizes robust evaluation protocols, semi-supervised techniques, and efficient inference pipelines to enhance clinical and space weather applications.
  • Key performance metrics such as Dice Similarity Coefficient for organ segmentation and True Skill Statistic for solar flares drive methodological rigor and innovation.

The FLARE 2023 Challenge encompasses two international benchmarking efforts launched in 2023, each targeting a critical domain in scientific AI: universal abdominal organ and pan-cancer segmentation in CT images, and rare-event forecasting for solar flares. Both competitions significantly advanced their fields via standardized large-scale datasets, robust evaluation protocols, and open-source resources, while highlighting methodological rigor for rare event detection and clinical translation.

1. Background and Motivation

The abbreviation "FLARE 2023 Challenge" refers to two distinct, high-profile challenges:

  • Abdomen CT Organ and Pan-cancer Segmentation: Orchestrated as the "Fast, Low-resource, and Accurate Organ and Pan-cancer segmentation in Abdomen CT" (FLARE), this challenge addressed the longstanding clinical need for an AI tool capable of universal, multi-organ, and pan-cancer lesion segmentation from highly diverse abdominal CT scans. Prevailing benchmarks (e.g., LiTS, KiTS, BraTS) focused on single organs or tumor types, limiting translational generalizability. FLARE 2023 aimed to overcome this by enabling holistic and reproducible assessments across diverse cancers and centers (Ma et al., 2024).
  • Solar Flare Prediction: In heliophysics, accurate and early solar flare prediction is an extreme class-imbalance and temporal coherence problem. Earlier work exposed inflated validation metrics due to overlapping time windows and class rarity, necessitating a robust reevaluation of sampling, normalization, and validation strategies for rare-event forecasting (Ahmadzadeh et al., 2021). Novel large-scale multimodal models (e.g., JW-Flare) have since established new performance ceilings by fusing magnetogram imagery with physics-driven parameters (Shao et al., 12 Nov 2025).

Both domains unite around robust protocol design for fair, repeatable comparison under significant class imbalance and annotation uncertainty.

2. Dataset Construction and Labeling Protocols

Abdomen CT Pan-cancer Segmentation

The FLARE 2023 CT challenge dataset comprises 4,650 abdominal CTs from over 40 institutions, featuring both contrast and non-contrast phases (plain, arterial, portal-venous, delayed) and multiple vendor platforms (Siemens, GE, Philips, Toshiba). The cohort focuses on adult oncology patients, capturing broad variability in scanning protocols (thickness ~1–5 mm, in-plane resolution ~0.6–1.0 mm).

  • Labels: Thirteen abdominal organs were contoured per RTOG and Netter’s atlas protocols, following the FLARE 2022 methodology. All visible intra-abdominal solid tumors, both primary and metastatic, were annotated in the tuning and test sets by senior radiologists using ITK-SNAP, with MedSAM model assistance.
  • Split:
    • Development: 2,200 partially-labeled scans, 1,800 entirely unlabeled (for semi-supervised methods)
    • Tuning: 100 fully annotated scans (public leaderboard)
    • Test: 400 fully annotated, multi-national cases (blinded inference) (Ma et al., 2024)

Quality control included dual-reader review and outlier correction, ensuring consensus clinical-grade reference labels.

Solar Flare Forecasting

The SWAN-SF dataset consists of multivariate time series from 4,075 solar active regions observed over nine years by SDO/HMI, each region described by 24 physics-based magnetic parameters (SHARP metadata). Time-series slicing uses a 12-h observation and 24-h prediction window, sliding in 1-h steps, with labels determined by the strongest GOES-class flare in that window.

  • Label taxonomy: Five GOES classes—X, M, C, B, N ("quiet"/A-class)—are assigned. Datasets are partitioned for climatologically balanced representation of major flares, ensuring train/validation/test independence and mitigating overestimation due to temporal coherence (Ahmadzadeh et al., 2021).
  • For JW-Flare: Training used SDO/HMI magnetogram images and SHARP-derived parameters, balancing flare/no-flare via aggressive over/undersampling to create stratified subsets for each prediction threshold (≥C, ≥M5, ≥X), with test splits comprising ~18,000 samples per class (Shao et al., 12 Nov 2025).

3. Challenge Protocols and Evaluation Methodologies

Abdomen CT Segmentation

  • Task: Voxel-wise segmentation of 13 organs and pan-cancer lesions.
  • Submission: All teams submitted inference pipelines (Dockerized) executed under standardized hardware (Xeon CPU, Quadro RTX 5000 GPU, 32 GB RAM).
  • Data Format: NIfTI (.nii.gz) with canonical orientation, original Hounsfield Units preserved, and no enforced resampling.
  • Efficiency: Teams limited to five daily submissions in development and a single final test submission.
  • Primary metrics:
    • Dice Similarity Coefficient (DSC):

    DSC(P,G)=2∣P∩G∣∣P∣+∣G∣\mathrm{DSC}(P, G) = \frac{2|P \cap G|}{|P| + |G|}

    where PP is prediction, GG ground truth. - Normalized Surface Dice (NSD): Measures boundary agreement within a specified tolerance Ï„\tau. - Efficiency: Inference time and GPU memory footprint (AUCGPU_\text{GPU}). - Instance metrics (for lesions): Precision, Recall, F1-score, Panoptic Quality (PQ) (Ma et al., 2024).

  • Ranking: Final team ranking applied a "rank-then-aggregate" approach, averaging ranks for organs, lesions, and efficiency across the hidden test set.

Solar Flare Forecasting

  • Key challenges: Extreme class imbalance (e.g., X-class flares:quiet ∼800:1\sim800:1), temporal coherence (overlapping time slices), and non-i.i.d. feature distributions.

  • Evaluation metrics:

    • True Skill Statistic (TSS):

    TSS=TPTP+FN−FPFP+TN\mathrm{TSS} = \frac{TP}{TP+FN} - \frac{FP}{FP+TN}

    Range: [−1,1][-1, 1]. Unbiased by class prevalence. - Heidke Skill Score (HSS2): Emphasizes balance in false positives/negatives. - Others: TPR, FPR, precision, recall, F1, accuracy as standard (see mathematical definitions above).

  • Partition-based validation ("multifold") and splitting by AR-ID are enforced to prevent temporal leakage and metric inflation, as random splits ("unifold") produce unreasonably high TSS due to near-duplicate slices in train/test (Ahmadzadeh et al., 2021).

4. Baseline Methods and Top-performing Approaches

Abdomen CT Organ and Pan-cancer Segmentation

  • Baseline: 3D nnU-Net.

  • Top teams (see (Ma et al., 2024)):

    • Team 1 (aladdin5): Cascaded nnU-Net architecture with ROI extraction, high-res segmentation, multi-resolution cropping, pseudo-labeling, GPU-accelerated preprocessing, and strong post-processing.
    • Team 2 (citi): Partially supervised nnU-Net, with masked channels for unlabeled classes, pseudo-label selection, and CutMix augmentations.
    • Team 3 (blackbean): Self-training paradigm, with iterative label fusion from large-to-small models.
    • Team 4 (hmi306): Two-stage hybrid PHTrans (Conv/Transformer) plus mean-teacher semi-supervised training.
    • Team 5 (hanglok): Cascaded SegFormer + UNETR networks, leveraging partial convolutions and self-training.

All high-performing pipelines invested in compound loss functions (Dice + Cross-Entropy or focal), heavy augmentation (patch-based, intensity, geometric), uncertainty-driven sampling, and connected-component post-processing. Pseudo-labeling from FLARE 2022 and adversarial augmentation further bolstered performance.

Solar Flare Prediction

  • Classical approaches: SVMs, shallow CNNs (VGG), and Swin-Transformer baselines.
  • SWAN-SF methodology: Undersampling/oversampling, climatology-preserving sampling, cost-sensitive SVMs, and time-series feature extraction (median, stdev, skewness, kurtosis) (Ahmadzadeh et al., 2021). Experiments showed time-series statistics consistently outperformed static ("last-value") approaches.
  • JW-Flare: Multimodal LLM (Qwen2-VL-7B-Instruct backbone) fusing a Vision Transformer for image encoding and natural language prompts for physics parameters. Trained via LoRA finetuning for high efficiency (1% updated parameters), it operated on 15 consecutive magnetograms plus SHARP-parameter prompts, outputting a constrained binary choice (Flare/None). Aggressive data balancing yielded optimal TSS without sacrificing recall for rare X-class events (Shao et al., 12 Nov 2025).

5. Results and Comparative Analysis

Abdomen CT Segmentation

The winning model achieved 92.3% average DSC for organs and 64.9% for lesions on the hidden test set, with the top five teams tightly clustered (organ DSC range: 91.8–92.3%; lesion DSC: 60–65%). Best organ-wise results were for large solid organs (liver, kidney, spleen >97% DSC); small/tubular organs lagged slightly (83–89% for gallbladder, adrenals; 90–98% for aorta/IVC).

  • Lesion segmentation: The winner attained median semantic DSC=75.9%, NSD<60%, instance precision ~80%, but recall ~20–30% and F1 ~40%. Larger lesions were segmented more accurately.
  • Efficiency: Winning pipeline processed each case in 8.6 s (3.56 GB GPU mem.), with leading competitors faster (~4 s) or more memory-efficient.
  • Ensembling: Provided marginal PQ gains with substantial variability per case.
  • Ranking stability: Confirmed with N=1000N=1000 bootstrap test samples (Kendall’s τ≳0.9\tau \gtrsim 0.9) (Ma et al., 2024).

Solar Flare Forecasting

  • JW-Flare: Established SOTA TSS for X-class events (TSS=0.95, TPR=1.00 on 18,949 test samples), outperforming SVM, CNN, and prior transformer models (TSS ~0.51–0.56). For M5-class, JW-Flare reached TSS=0.88 (TPR=0.97).
  • Baseline models: Non-multimodal benchmarks exhibited substantially lower TSS, especially for rare events.
  • Temporal generalization: Performance on independent observatories (HSOS, ASO-S/FMG) remained strong (TPR=0.83–0.99), but diminished on temporally offset SDO/HMI (TSS=0.64), indicating need for ongoing temporal adaptation (Shao et al., 12 Nov 2025).
Challenge Leading Metric (Test) Top Result Notable Feature
Abdomen CT Organ DSC (%) 92.3 (±3.3) Cascaded nnU-Net, pseudo-labels
Lesion DSC (%) 64.9 (±27.4)
Solar Flare TSS (X-class) 0.95 (TPR=1.0) Multimodal LLM, LoRA
TSS (M5-class) 0.88 Physics-parametric prompts

6. Methodological Advances and Insights

Class Imbalance and Temporal Coherence

Solar flare forecasting revealed pronounced overoptimism in models trained with temporally incoherent splits; strict partitioning or AR-based groupings are essential for accurate generalization (Ahmadzadeh et al., 2021). Class imbalance remedies—cost-sensitive weighting, climatology-preserving sampling, combined under/oversampling—directly affected skill metrics and model stability.

For the segmentation task, challenge design embraced real-world partial labels and highly unbalanced lesion prevalence, necessitating semi-supervised techniques and robust loss weighting.

Multimodal and Partially-Supervised Learning

JW-Flare's demonstration that multimodal LLMs (with prompt-constrained outputs) can surpass physics-driven CNNs highlights the value of joint vision-text modeling and flexible dataset ontologies for rare-event prediction (Shao et al., 12 Nov 2025). Explainability probes tied high-importance neuron subsets to physical solar concepts (e.g., magnetic reconnection), evidencing domain knowledge reuse from pretraining.

In segmentation, success correlated with aggressive pseudo-label generation, mean-teacher or self-training frameworks, and exploitation of prior-year winners' outputs for ROI priming (Ma et al., 2024).

7. Open Challenges and Future Directions

Remaining bottlenecks include improving detection and segmentation of small organs and lesions (notably, persistent low recall in lesion instance segmentation), generalizing models temporally and geographically (solar cycle adaptation, unseen CT sites), and incorporating multimodal inputs (e.g., PET/MR, chromospheric data). Advancing synthetic data augmentation (as in recent CVPR methods) and integrating report-guided segmentation protocols are identified expansion paths.

Publicly available datasets, annotations, and Dockerized pipelines (https://codalab.lisn.upsaclay.fr/competitions/12239 for segmentation; SWAN-SF and JW-Flare code for flare forecasting) now serve as the baseline for further progress. To ensure transparency, both competitions adopted BIAS and CLAIM checklists for standardized reporting.

A plausible implication is that both clinical imaging AI and space weather forecasting will continue to converge on semi-supervised, multimodal, and robustly evaluated frameworks, anchored by large-scale open benchmarks and increasingly transparent methodological standards (Ahmadzadeh et al., 2021, Ma et al., 2024, Shao et al., 12 Nov 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FLARE 2023 Challenge.