Emilia Dataset: Radar, Speech & Landslide Benchmarks
- Emilia Dataset is a family of three open-access resources covering radar precipitation nowcasting, multilingual speech synthesis, and landslide detection.
- The radar dataset offers six years of high-resolution, 5-minute interval composites and robust quality control for advanced spatiotemporal modeling.
- The speech dataset processed via the Emilia-Pipe and the satellite landslide subset enable benchmarking for TTS research and disaster mapping.
The term “Emilia Dataset” refers to three distinct, high-profile academic datasets associated with the Emilia-Romagna region: (1) a large-scale, continuous radar-based precipitation dataset for spatiotemporal modeling, (2) a vast multilingual, in-the-wild speech corpus and pipeline for text-to-speech and related tasks, and (3) the Emilia-Romagna subset within the LMHLD satellite landslide benchmark. All three are recent, open-access, and serve as foundational resources for benchmarking and methodological innovation in their respective domains.
1. Radar-Based Precipitation: Emilia Dataset for Nowcasting
The Emilia precipitation dataset underpins the GPTCast generative nowcasting framework. It aggregates six years (2015–2020) of C-band radar coverage over Emilia-Romagna, providing 1 km × 1 km, 5-minute interval composites across 71,172 km². The data result from merging volumetric scans of two dual-polarization radars (Gattatico at 44°47′27″N, 10°29′54″E; San Pietro Capofiume at 44°39′19″N, 11°37′23″E), with each scan preprocessed via the ARPAE QC (Fornasiero et al. 2006, 2008: beam-blockage correction, clutter/anomalous scattering suppression, vertical-profile correction). Only “precipitating” frames are retained, representing ≈28.5% (179,264/630,720) of the total possible frames; non-precipitating periods are discarded to focus model training on relevant spatiotemporal sequences (Franch et al., 2024).
Summary Table: Radar Emilia Dataset Characteristics
| Aspect | Specification/Range | Notes |
|---|---|---|
| Spatial Extent | 71,172 km², 290×373 km, 1 km-grid | Po valley, N. Apennines, coasts |
| Temporal Coverage | Jan 2015–Dec 2020, every 5 min | Full convective & stratiform seasons |
| Channels | Single-channel reflectivity (dBZ) | 601 quantized levels, [0, 60] dBZ |
| Training Samples | 149,524 frames (train), 7,869 (val) | + Event-based TTS/FTS test splits |
For model use cases, the data fit a VQGAN–GPT compression/tokenization pipeline. Random 192×192 crops, rotations, and flips are applied on-the-fly for augmentation. Reflectivity is quantized in 0.1 dBZ steps producing 601 possible values per pixel, then mapped to tokens for autoregressive modeling. The absence of external predictors ensures an end-to-end, data-driven nowcasting pipeline. Rain-rate conversion uses the standard Marshall–Palmer (, ). Handling of missing data is implicit: frames with missing data are omitted, resulting in contiguous, complete, 5-minute sequences throughout (Franch et al., 2024).
2. Emilia Speech Dataset and the Emilia-Pipe Workflow
The Emilia dataset family is an extensive collection for multilingual, spontaneous speech synthesis. The core resource (Emilia) aligns over 101,000 hours across six languages—Chinese (49.9k), English (46.8k), German (1.6k), French (1.4k), Japanese (1.7k), and Korean (0.2k)—with segments of spontaneous and conversational style dominating, sourced from global podcasts, interviews, talk shows, and related domains. It is constructed using the open-source Emilia-Pipe, a six-stage pipeline designed for high-throughput, high-fidelity filtering and annotation (He et al., 27 Jan 2025, He et al., 2024).
Pipeline Stages and Data Quality Steps:
- Standardization: Converts all input to 24 kHz, mono, 16-bit WAV, normalized to –20 dBFS; waveform scaling to [–1, 1].
- Source Separation: UVR-MDX-Net removes music/noise (SDR ≈ 11.15 dB).
- Speaker Diarization: PyAnnote 3.1 segments into single-speaker regions (speaker anonymized).
- Fine-Grained Segmentation (VAD): Silero-VAD (ROC-AUC ≈ 0.99) divides into 3–30 s clips.
- ASR Transcription: Whisper-Medium via WhisperX; batch, no VAD redundancy; outputs time-aligned transcript.
- Filtering:
- Language ID: must reach ≥80% Whisper confidence on one of the six languages.
- Quality: DNSMOS P.835 OVRL ≥3.0.
- Speaking Rate: Outliers removed via IQR-based per-segment phone duration. Post-filtering, on a 600 h test slice, only ~29% is retained with DNSMOS improving from 2.50 to 3.26; mean utterance length post-filter is ≈9 seconds (He et al., 27 Jan 2025, He et al., 2024).
Metadata and Partitioning:
Segments are released with aligned file path, speaker cluster ID, timestamps, language label with confidence, DNSMOS, and transcript; no manual verification. The dataset is available under CC-BY-NC-4.0 (Emilia) and CC-BY-4.0 (Emilia-Large, 216k h, which adds YODAS2 and boosts data for lower-resource languages) (He et al., 27 Jan 2025).
Comparative Table: Emilia and Selected Speech Corpora
| Dataset | Source Style | Hours | Langs | Processing Pipeline |
|---|---|---|---|---|
| Emilia | In-the-wild, mixed | 101,654 | En/Zh/De/Fr/Ja/Ko | Open-source, 6-stage (Emilia-Pipe) |
| MLS | Audiobook | 51,000 | 8 | None reported |
| GigaSpeech | In-the-wild (En) | 10,000 | 1 | N/A |
Objective (WER, S-SIM, FSD) and subjective (SMOS, CMOS) evaluations demonstrate that models trained on Emilia outperform those trained on audiobook corpora for spontaneous, real-world TTS tasks; the dataset’s diversity in timbre, language, and style is validated via acoustic and semantic PCA spread (He et al., 27 Jan 2025, He et al., 2024).
3. Emilia-Romagna Subset within LMHLD (Remote Sensing Landslides)
The Emilia-Romagna subset of the LMHLD is a benchmark for landslide detection in moderate-resolution satellite imagery (Liu et al., 27 Feb 2025). It composes a region-specific selection of 10 m resolution Sentinel-2 L2A tiles covering the hills–plains transition after the May 2023 floods (acquisitions: August 2023), mosaicking 2–3 tiles to span the area. Each sample patch (128×128 px, ≈1.64×1.64 km) is labeled via binary segmentation mask, organized into train/val/test directories. The labeling protocol references the ISPRA-IFFI inventory, supplements with DEM-derived terrain/slope, and applies expert cross-validation, reaching >95% inter-annotator agreement.
Landslide Inventory Statistics (per polygon):
| Statistic | Value/Distribution |
|---|---|
| RPC (pixel count) | min = 1, max = 4,735, Q₁ = 18, median = 33, Q₃ = 67 |
Patches must include ≥10% landslide pixels to qualify. All image patches and masks are released as GeoTIFF (images: 4-band, uint16 reflectance; masks: uint8 0/1), georeferenced to UTM Zone 32N (WGS 84); patch-specific metadata and region-wide shapefiles are provided. The data are accessible on Zenodo under a CC BY 4.0 license (Liu et al., 27 Feb 2025).
4. Benchmarking and Downstream Application
Precipitation Nowcasting: The Emilia radar dataset enables fully data-driven ensemble precipitation forecasting using GPT-style autoregression over VQGAN-tokenized latent representations. Testing on manually selected extreme convective and stratiform events validates robust uncertainty quantification and accuracy advantages over classical extrapolation (Franch et al., 2024).
Speech Synthesis and Analysis: Emilia and Emilia-Large afford unprecedented scale for TTS, speaker adaptation, and cross-lingual synthesis. Model scaling ablation experiments reveal that most performance gains occur from 5k to 100k h, with diminishing returns thereafter—a plausible implication is that 0.5–1B parameter TTS models in practice saturate at ≈100k h for English-only synthesis, and that scaling further mostly benefits under-resourced languages or cross-lingual transfer (He et al., 27 Jan 2025).
Landslide Detection: Emilia-Romagna patches are benchmarked with seven U-Net family models. Best-balanced performance is achieved by Attention U-Net (F1 = 0.80). Dense U-Net and U-Net++ also yield high performance, though Res U-Net achieves maximal precision at the cost of recall. The challenging landscape—involving mixed agricultural patterns and fine-grained slides—exposes the limitations of purely spectral discriminants (Liu et al., 27 Feb 2025).
5. Licensing, Quality, and Availability
All three datasets emphasize open science and rigorous QC protocols. The meteorological and landslide subsets are available under Creative Commons Attribution licenses, supporting reproducible research and downstream adaptation. Speech data, though split by variant for license type, include open-source processing code to facilitate pipeline extension.
Key quality metrics are transparently documented:
- Radar: Strict ARPAE QC protocol, frame selection on non-zero precipitation.
- Speech: DNSMOS P.835 OVRL ≥3.0 post-filtering, per-segment phone duration outlier removal, and robust diarization/ASR.
- Landslides: Cross-checked polygon annotation, ≥95% agreement, and exclusion of ambiguous patches.
These protocols ensure that each Emilia variant sets new benchmarks for reliability, dynamism, and generalization in their domains.
6. Significance and Future Directions
The “Emilia Dataset” family exemplifies a migration in open data practices—away from curated, homogeneous corpora toward heterogeneous, event-rich, large-scale resources with high-quality programmatic filtering and annotation. For atmospheric modeling, it enables modern generative architectures to close the gap with operational ensemble guidance. For multilingual, spontaneous speech, it makes possible the synthesis and analysis of utterances with real-world speaker diversity and nuanced prosodic variability. For earth observation, it brings supervised learning closer to challenging continental-scale, disaster-mapping tasks grounded in expert-reviewed, fine-resolution remote sensing.
A plausible implication is that the paradigms and QC/annotation strategies used by Emilia and its associated pipelines will inform open dataset construction in other domains characterized by vast raw data availability, with reproducible pipelines partly superseding static “snapshot” datasets. Continual updates—such as incorporating new radar years, new speech source languages, or new remote-sensing events—are technically viable extensions and likely research directions for each Emilia data stream.