Papers
Topics
Authors
Recent
Search
2000 character limit reached

CodaBench: GW Detection Dataset

Updated 24 January 2026
  • CodaBench Dataset is a large-scale, labeled resource that simulates gravitational wave events (BBH and SGLF) in realistic LIGO strain data.
  • It provides structured training, validation, and test splits with controlled anomaly rates to rigorously evaluate detection pipelines.
  • The dataset employs an STFT-based preprocessing pipeline without additional filtering, supporting template-free anomaly detection methods.

CodaBench Dataset provides a large-scale, labeled resource for research on unsupervised and template-free detection of gravitational wave (GW) events in Advanced LIGO strain data. Constructed to support the development of anomaly detection methods for astrophysical event localization in realistic interferometer backgrounds, it consists of cleaned time-series from the Hanford (H1) and Livingston (L1) detectors with systematically injected compact binary coalescences (BBH) and sine-Gaussian burst signals (SGLF). Each datum covers a synchronized, 50 ms window from both sites, with explicit train, validation, and test splits and precise control over anomaly rates. This dataset facilitates rigorous evaluation of detection algorithms in a strongly imbalanced regime, closely matching real GW search operational scenarios (Ratner, 17 Jan 2026).

1. Data Origin, Background, and Signal Simulation

CodaBench draws its background data from real LIGO strain time series acquired at Hanford and Livingston observatories (H1: Washington, L1: Louisiana), separated by 3,000 km. The GW-sensitive strain channels are pre-cleaned and segmented into discrete 50 ms windows, preserving the typical GW band (up to ≈1.6 kHz) by zeroing frequency content above this threshold. No whitening or bandpass filtering is conducted beyond this point.

Synthetic transient events are injected atop this background, divided into two major populations:

  • Compact Binary Coalescences (BBH): 100,000 instances. Waveforms generated from general-relativistic signal templates with randomized intrinsic (masses, spins) and extrinsic (orientation) parameters, amplitude-scaled to match a defined signal-to-noise ratio (SNR).
  • Sine-Gaussian Low-Frequency Bursts (SGLF): 100,000 instances. Each SGLF follows

h(t)=Aexp[(2πf0t)2/Q2]sin(2πf0t+ϕ)h(t) = A\,\exp[-(2\pi f_0 t)^2 / Q^2] \sin(2\pi f_0 t + \phi)

where f0[200,800]f_0\in[200,800] Hz and Q[8,20]Q\in[8,20].

All injections are applied time-aligned to both detectors, with SGLF or BBH signals present in the same 50 ms window (within a 10 ms coincidence).

2. Dataset Structure, Splits, and Label Conventions

The challenge dataset comprises three disjoint sets, each consisting of 100,000 examples:

  • Pure background (no injection)
  • BBH injection only
  • SGLF injection only

Each window yields synchronized, 50 ms time series from H1 and L1. For experiments, train/validation splits follow an 80/20 percent rule within each class. The test set combines all 50,000 background windows with 9,900 held-out BBH or SGLF injections, resulting in a nominal anomaly rate ≈16%; true evaluation resamples this for operational false-alarm rate computation (down to 0.2% or 0.02%).

A binary label y=1y=1 designates the presence of an astrophysical injection (BBH or SGLF) within a window at either detector; y=0y=0 signals pure background.

Key statistics for reference anomaly rates are:

  • Typical training regime: 100 signals + 50,000 backgrounds (arate=0.2%a_{\text{rate}}=0.2\%)
  • Alternative regime: 100 signals + 500,000 resampled backgrounds (0.02%0.02\% anomaly)

3. Preprocessing Pipeline and Data Representations

Raw 50 ms strain sequences (per site) undergo a short-time Fourier transform (STFT):

Xk,m=n=0N1x[n]w[nmR]e2πikn/NX_{k, m} = \sum_{n=0}^{N-1} x[n]\,w[n-mR]\,e^{-2\pi i k n/N}

with window length N=128N=128, hop size R=32R=32, and 96-sample overlap. Only frequencies up to 1.6 kHz are retained, yielding a time-frequency representation directly consumed by neural networks.

No additional pre-processing steps (e.g., whitening, normalization) are mandated beyond those performed by the original CodaBench pipeline.

4. Injection Parameterization and Signal-to-Noise Ratio

Each injection (BBH or SGLF) is parameterized by intrinsic waveform physics. Precise SNR per event in training sets is private; however, the evaluation set includes injected SNRs spanning approximately [5,30][5,30]. The matched-filter SNR is calculated as:

ρ=(hh),(ab)40a~(f)b~(f)Sn(f)df\rho = \sqrt{(h|h)}, \quad (a|b) \equiv 4 \,\Re \int_0^{\infty} \frac{\tilde a^*(f)\, \tilde b(f)}{S_n(f)}\,df

where Sn(f)S_n(f) is the one-sided noise power spectral density.

Test protocols stratify model performance by recall at fixed FAR (False-Alarm Rate), typically 1\le1 event/year (for 3×1073 \times 10^7 windows/year), and as a function of injected SNR.

The principal metric is the area under the precision–recall curve (PR-AUC) at defined anomaly rates, with additional focus on “Recall@FAR=1/yr” (the recall for events at or below one false alarm per year) as a function of SNR.

Recommended protocol:

  • Train models on the supplied 80% split (or resampled background+synthetic mix)
  • Validate on the matched 20% hold-out subset
  • For final evaluation, test on the challenge set, adjusting for real-world population imbalance

Cross-validation schemes (e.g., kk-fold) are permitted, and “trigger” systems (such as excess-power filtering to pre-select the most anomalous 0.4% of windows) are advised to address computational constraints during unsupervised learning.

No pure “background-only” training set is required for anomaly detection: the actual challenge construction, with realistic LIGO noise, implicitly defines the null distribution.

6. Application Domains and Research Utility

CodaBench explicitly serves the development and benchmarking of unsupervised, template-free detection pipelines for GW astronomy, addressing the detection of sources without accurate waveform models (e.g., core-collapse supernovae, glitches, cosmic strings). It is well suited to scenarios where supervised templates are unavailable or insufficient.

The dataset is optimized for the evaluation of statistical anomaly detectors—including neural coincidence methods exploiting event overlap across spatially separated detectors, as well as frequency-domain interpretable architectures. Established metric protocols and pass/fail annotation on highly imbalanced, real background enable stringent comparability across methods.

7. Access and Compliance

Access to the CodaBench challenge data is provided to registered participants through the relevant competition portal. All experimental usage must comply with the data distribution and ethical guidelines stipulated by the challenge organizers. Detailed description, split conventions, and instructions for programmatic download and evaluation scripting are included in the released materials (Ratner, 17 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CodaBench Dataset.