CodaBench: GW Detection Dataset
- CodaBench Dataset is a large-scale, labeled resource that simulates gravitational wave events (BBH and SGLF) in realistic LIGO strain data.
- It provides structured training, validation, and test splits with controlled anomaly rates to rigorously evaluate detection pipelines.
- The dataset employs an STFT-based preprocessing pipeline without additional filtering, supporting template-free anomaly detection methods.
CodaBench Dataset provides a large-scale, labeled resource for research on unsupervised and template-free detection of gravitational wave (GW) events in Advanced LIGO strain data. Constructed to support the development of anomaly detection methods for astrophysical event localization in realistic interferometer backgrounds, it consists of cleaned time-series from the Hanford (H1) and Livingston (L1) detectors with systematically injected compact binary coalescences (BBH) and sine-Gaussian burst signals (SGLF). Each datum covers a synchronized, 50 ms window from both sites, with explicit train, validation, and test splits and precise control over anomaly rates. This dataset facilitates rigorous evaluation of detection algorithms in a strongly imbalanced regime, closely matching real GW search operational scenarios (Ratner, 17 Jan 2026).
1. Data Origin, Background, and Signal Simulation
CodaBench draws its background data from real LIGO strain time series acquired at Hanford and Livingston observatories (H1: Washington, L1: Louisiana), separated by 3,000 km. The GW-sensitive strain channels are pre-cleaned and segmented into discrete 50 ms windows, preserving the typical GW band (up to ≈1.6 kHz) by zeroing frequency content above this threshold. No whitening or bandpass filtering is conducted beyond this point.
Synthetic transient events are injected atop this background, divided into two major populations:
- Compact Binary Coalescences (BBH): 100,000 instances. Waveforms generated from general-relativistic signal templates with randomized intrinsic (masses, spins) and extrinsic (orientation) parameters, amplitude-scaled to match a defined signal-to-noise ratio (SNR).
- Sine-Gaussian Low-Frequency Bursts (SGLF): 100,000 instances. Each SGLF follows
where Hz and .
All injections are applied time-aligned to both detectors, with SGLF or BBH signals present in the same 50 ms window (within a 10 ms coincidence).
2. Dataset Structure, Splits, and Label Conventions
The challenge dataset comprises three disjoint sets, each consisting of 100,000 examples:
- Pure background (no injection)
- BBH injection only
- SGLF injection only
Each window yields synchronized, 50 ms time series from H1 and L1. For experiments, train/validation splits follow an 80/20 percent rule within each class. The test set combines all 50,000 background windows with 9,900 held-out BBH or SGLF injections, resulting in a nominal anomaly rate ≈16%; true evaluation resamples this for operational false-alarm rate computation (down to 0.2% or 0.02%).
A binary label designates the presence of an astrophysical injection (BBH or SGLF) within a window at either detector; signals pure background.
Key statistics for reference anomaly rates are:
- Typical training regime: 100 signals + 50,000 backgrounds ()
- Alternative regime: 100 signals + 500,000 resampled backgrounds ( anomaly)
3. Preprocessing Pipeline and Data Representations
Raw 50 ms strain sequences (per site) undergo a short-time Fourier transform (STFT):
with window length , hop size , and 96-sample overlap. Only frequencies up to 1.6 kHz are retained, yielding a time-frequency representation directly consumed by neural networks.
No additional pre-processing steps (e.g., whitening, normalization) are mandated beyond those performed by the original CodaBench pipeline.
4. Injection Parameterization and Signal-to-Noise Ratio
Each injection (BBH or SGLF) is parameterized by intrinsic waveform physics. Precise SNR per event in training sets is private; however, the evaluation set includes injected SNRs spanning approximately . The matched-filter SNR is calculated as:
where is the one-sided noise power spectral density.
Test protocols stratify model performance by recall at fixed FAR (False-Alarm Rate), typically event/year (for windows/year), and as a function of injected SNR.
5. Challenge Evaluation Protocol and Recommended Usage
The principal metric is the area under the precision–recall curve (PR-AUC) at defined anomaly rates, with additional focus on “Recall@FAR=1/yr” (the recall for events at or below one false alarm per year) as a function of SNR.
Recommended protocol:
- Train models on the supplied 80% split (or resampled background+synthetic mix)
- Validate on the matched 20% hold-out subset
- For final evaluation, test on the challenge set, adjusting for real-world population imbalance
Cross-validation schemes (e.g., -fold) are permitted, and “trigger” systems (such as excess-power filtering to pre-select the most anomalous 0.4% of windows) are advised to address computational constraints during unsupervised learning.
No pure “background-only” training set is required for anomaly detection: the actual challenge construction, with realistic LIGO noise, implicitly defines the null distribution.
6. Application Domains and Research Utility
CodaBench explicitly serves the development and benchmarking of unsupervised, template-free detection pipelines for GW astronomy, addressing the detection of sources without accurate waveform models (e.g., core-collapse supernovae, glitches, cosmic strings). It is well suited to scenarios where supervised templates are unavailable or insufficient.
The dataset is optimized for the evaluation of statistical anomaly detectors—including neural coincidence methods exploiting event overlap across spatially separated detectors, as well as frequency-domain interpretable architectures. Established metric protocols and pass/fail annotation on highly imbalanced, real background enable stringent comparability across methods.
7. Access and Compliance
Access to the CodaBench challenge data is provided to registered participants through the relevant competition portal. All experimental usage must comply with the data distribution and ethical guidelines stipulated by the challenge organizers. Detailed description, split conventions, and instructions for programmatic download and evaluation scripting are included in the released materials (Ratner, 17 Jan 2026).