Papers
Topics
Authors
Recent
Search
2000 character limit reached

GCC-PHAT Data Augmentation for SSL

Updated 2 February 2026
  • GDA is a feature-level augmentation strategy that generates synthetic SSL samples by shifting and scaling the dominant GCC-PHAT peaks to rebalance underrepresented classes.
  • The method extracts key peak statistics—positions and amplitudes—from cross-correlation signals to maintain realistic DoA characteristics in augmented data.
  • Empirical results show that GDA, especially when combined with ADIR, enhances SSL accuracy (up to 89%) and reduces mean absolute error, validating its effectiveness in incremental learning.

GCC-PHAT-based Data Augmentation (GDA) is a technique designed to address intra-task class imbalance in sound source localization (SSL) by generating synthetic training features derived from the peak statistics of the Generalized Cross-Correlation with Phase Transform (GCC-PHAT) input domain. GDA operates at the feature level, utilizing the statistical properties of the dominant GCC-PHAT peaks to synthesize new examples for underrepresented (tail) classes, thereby ameliorating the long-tailed direction-of-arrival (DoA) distribution that commonly impairs SSL performance in real-world, incremental learning scenarios (Fan et al., 26 Jan 2026).

1. Mathematical Formulation of GCC-PHAT in SSL

GCC-PHAT transforms the time-domain microphone signals x(t),y(t)x(t), y(t) into frequency domain representations X(f),Y(f)X(f), Y(f). The core feature, the GCC-PHAT function,

Rxy(τ)=X(f)Y(f)X(f)Y(f)ej2πfτdf,R_{xy}(\tau) = \int_{-\infty}^{\infty} \frac{X(f)Y^*(f)}{|X(f)Y^*(f)|} e^{j2\pi f \tau} df,

computes a phase-weighted cross-correlation that accentuates the time-lag corresponding to the DoA between paired microphones. For PP microphone pairs and DaD_a delay bins per pair, feature extraction over a frame yields the vector

x=[Rxy(1)(τ1),,Rxy(P)(τDa)]RPDa.\mathbf{x} = [R^{(1)}_{xy}(\tau_1),\ldots, R^{(P)}_{xy}(\tau_{D_a})]^\top \in \mathbb{R}^{P\cdot D_a}.

These features are the canonical input for downstream incremental SSL models in the described learning pipeline.

2. Peak Statistic Extraction and Characterization

Empirical analysis demonstrates that for each microphone pair, a single dominant peak in the GCC-PHAT domain conveys the DoA information. For each input sample ii and microphone-pair kk (k=1,,Pk = 1,\ldots,P), GDA extracts:

  • Peak position: pi,k=argmaxτRxy(k)(τ)p_{i,k} = \arg\max_{\tau} R^{(k)}_{xy}(\tau),
  • Peak amplitude: ai,k=maxτRxy(k)(τ)a_{i,k} = \max_{\tau} R^{(k)}_{xy}(\tau).

Aggregate statistics over all samples of class cc are computed: pˉc,k=1Nci:yi=cpi,k,aˉc,k=1Nci:yi=cai,k,\bar{p}_{c,k} = \frac{1}{N_c}\sum_{i: y_i=c}p_{i,k}, \qquad \bar{a}_{c,k} = \frac{1}{N_c}\sum_{i: y_i=c}a_{i,k}, together with empirical variances. No clustering beyond this dominant peak is necessary, as only the highest peak per segment is relocated during augmentation.

3. GDA Algorithmic Procedure

The augmentation process targets all “tail” classes per task tt—those with sample count Nc(t)<αMtN_c^{(t)} < \alpha M_t, where Mt=maxcNc(t)M_t = \max_{c'} N_{c'}^{(t)} and α=0.5\alpha=0.5. For each such class cc, Kc=αMtNc(t)K_c = \lceil \alpha M_t - N_c^{(t)} \rceil synthetic examples are generated according to Algorithm 1:

  1. Initialization: Start with xnew0PDa\mathbf{x}_{\mathrm{new}}\leftarrow \mathbf{0}^{P\cdot D_a}.
  2. For each microphone pair kk:
    • Extract the kkth slice xb(k)\mathbf{x}_b^{(k)} from a base abundant-class sample.
    • Compute shift Δpk=pˉc,kpˉc,k\Delta p_k = \bar{p}_{c,k} - \bar{p}_{c',k}.
    • Cyclically shift xb(k)\mathbf{x}_b^{(k)} by Δpk\Delta p_k samples along delay.
    • Rescale such that max(xb(k))=aˉc,k\max(\mathbf{x}_b^{(k)}) = \bar{a}_{c,k}.
    • Inject i.i.d. Gaussian noise N(0,σn2)\mathcal{N}(0, \sigma_n^2), σn=0.05max(xb(k))\sigma_n=0.05\max(\mathbf{x}_b^{(k)}).
    • Place result in xnew\mathbf{x}_{\mathrm{new}}.
  3. Repeat for KcK_c synthetic variants using independent bases.

This strategy ensures synthetic features for tail classes inherit the dominant temporal and amplitude signature of abundant classes, adjusted to the mean DoA characteristics of the target class.

4. Hyperparameterization and Statistical Rationale

Key hyperparameters are:

  • P=6P = 6 microphone pairs,
  • Da=51D_a = 51 delay bins per pair,
  • α=0.5\alpha = 0.5: target tail class cardinality reaches 50% of the largest class post-augmentation,
  • σn=0.05max()\sigma_n = 0.05\max(\,\cdot\,): low-variance noise introduces moderate feature diversity.

No secondary amplitude thresholding or multi-peak selection is required. GDA targets exactly the classes beneath the thresholded occupancy, achieving a more uniform class histogram and ensuring rare DoA labels are sufficiently represented.

5. Integration into Incremental SSL Training

In each incremental learning task, prior to classifier update:

  • GCC-PHAT features are extracted from task samples.
  • Tail classes are detected by occupancy.
  • For each, KcK_c synthetic features are generated following Algorithm 1 and labeled identically to real data.
  • The data for task tt becomes D~t=Dt{synthetic pairs}\tilde{\mathcal{D}}_t = \mathcal{D}_t \cup \{\text{synthetic pairs}\}.

The model architecture consists of a frozen 3-layer MLP feature extractor (fixed after first task) and a task-adaptive linear classifier, updated with the Analytic Dynamic Imbalance Rectifier (ADIR). The cross-entropy objective

(f(x),z)=c=1360zclogz^c+(1zc)log(1z^c)\ell(f(\mathbf{x}), \mathbf{z}) = -\sum_{c=1}^{360} z_c \log \hat{z}_c + (1-z_c)\log(1-\hat{z}_c)

is evaluated on the GDA-augmented dataset. No additional regularization is applied. GDA's effect is to increase rare class counts, directly influencing the class-weighting scheme in ADIR (πc=1/Nc\pi_c=1/N_c) and reducing global Gini imbalance.

6. Empirical Impact and Ablation

Ablation studies isolate the contribution of GDA in combination and separately from ADIR. Reported results on the SSLR benchmark are:

  • Baseline (no GDA/ADIR): MAE = 7.5°, ACC = 72.0%, BWT = –17.7
  • GDA only: MAE = 7.4°, ACC = 75.0%, BWT = –15.8
  • ADIR only: MAE = 6.1°, ACC = 82.4%, BWT = +1.4
  • GDA + ADIR: MAE = 5.3°, ACC = 89.0%, BWT = +1.6

This demonstrates that GDA independently yields a 3-point increase in accuracy by mitigating intra-task skew. In synergy with ADIR, the framework delivers substantial gains—accuracy increases to 89% and mean absolute error drops to 5.3°, alongside positive backward transfer (Fan et al., 26 Jan 2026).

7. Context and Significance in Robust SSL

GDA addresses structural challenges in SSL arising from unbalanced DoA class histograms, prevalent in naturalistic, evolving sound field deployments. By operating on analytically defined feature statistics and integrating seamlessly with analytic incremental imbalance rectification (ADIR), GDA enables on-the-fly, low-complexity rebalancing of each task dataset. The method avoids the need for storage of past task exemplars and avoids introducing hand-tuned regularizers or high-variance heuristic augmentations. A plausible implication is streamlined adoption for real-world incrementally learned acoustic localization systems, particularly under non-stationary directional statistics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GCC-PHAT-based Data Augmentation (GDA).