TFGA-Net Model for EEG-Guided Speaker Extraction
- TFGA-Net is an integrated deep learning model that fuses multi-scale temporal-frequency EEG and acoustic features using graph attention mechanisms for brain-controlled speaker extraction.
- It employs adaptive convolutional layers, graph convolutions, and self-attention to effectively model cortical topology and enhance signal-to-distortion ratios.
- The model's superior performance in SI-SDR, STOI, and PESQ metrics highlights its potential for real-time assistive hearing devices and advanced neuro-acoustic applications.
TFGA-Net, or Temporal-Frequency Graph Attention Network, is an integrated deep learning architecture designed for brain-controlled speaker extraction using electroencephalography (EEG) signals. This model leverages the neural activity of listeners to extract the target speaker's voice from complex auditory scenes. TFGA-Net systematically addresses the challenge of mapping common information between EEG and speech by employing multi-scale temporal-frequency feature extraction, graph-based modeling of cortical topology, and advanced fusion mechanisms. The architecture demonstrates consistent improvements over state-of-the-art methods in multiple objective benchmarks for the task of EEG-driven speech separation.
1. Architectural Composition
TFGA-Net is structured into four main modules:
- Speech Encoder: A single-layer 1D convolution with ReLU activation transforms the raw speech mixture to , with kernel and stride chosen to downsample temporal resolution.
- EEG Encoder: The EEG signal is processed in two branches:
- Temporal branch: Five parallel Conv1D layers with exponentially decaying kernel sizes yield for .
- Frequency branch: Per-channel STFT yields power spectral density (PSD) and differential entropy (DE), averaged into canonical bands () to form .
- Graph Convolution and Self-Attention:
- Temporal and frequency features are structured as graphs: each EEG channel is a node, edges from cortical connectivity, initial adjacency matrix .
- Graph convolution is performed per view:
where is BN+ELU, and are learned weights. - Concatenated features are passed through positionally encoded self-attention: .
Speaker Extraction Network:
- Fused features guide separation.
- Separation employs MossFormer2: combines MossFormer module (local full attention, linearized global attention, convolutional gating) and RNN-Free Recurrent module (dilated FSMN, gated convolution).
- Speech Decoder: Masked speech embedding is inverse-mapped to waveform using transposed 1D convolution.
2. Temporal-Frequency Feature Extraction in EEG
TFGA-Net's EEG encoder is specifically engineered to capture rich neural representations:
- Temporal Extraction: Multiple Conv1D layers with kernel sizes decreasing exponentially provide broad and fine-grained temporal features, concatenated and merged via pointwise convolution for a unified representation.
- Frequency Feature Extraction: STFT per channel, followed by PSD and DE calculation, averages EEG power within five canonical frequency bands, resulting in spectral features.
- Cortical Topology and Non-Euclidean Modeling: EEG electrodes structured as graph nodes, adjacency matrices (, ) model spatial connectivities. GCN application mitigates Euclidean constraints and facilitates long/short-distance brain network representation.
- Global Dependency Modeling: Self-attention over concatenated graph features and positional encodings captures multichannel global dependencies not available from convolution alone.
3. Fusion of EEG and Speech Features
TFGA-Net integrates neural and acoustic information through systematic fusion:
- Concatenation and Integration: Speech features and EEG embedding are concatenated, followed by Conv1D to merge and compress channel information.
- Advanced Separation via MossFormer2: This separator utilizes:
- MossFormer submodule: Applies full attention in local chunks and linearized global attention, refined by convolutional gating; e.g., , where are projected features, is attention, is sigmoid, is elementwise product.
- RNN-Free Recurrent submodule: Uses dilated FSMN for memory and gating, , , .
- Preservation of Rhythmic and Prosodic Speech Features: MossFormer2 is adept at maintaining speech rhythm and prosody as well as suppressing interference—critical for context-accurate speaker extraction.
4. Quantitative Performance Evaluation
TFGA-Net has been benchmarked against state-of-the-art baselines on two datasets:
Dataset | TFGA-Net SI-SDR | Key Baselines (SI-SDR) | Additional Metrics |
---|---|---|---|
Cocktail Party | 15.91 dB | UBESD: 8.54; BASEN: 11.56; M3ANet: 13.95 | STOI, ESTOI, PESQ improved |
KUL | 16.9 dB | UBESD: 6.1; NeuroHeed: 14.6 | STOI, ESTOI, PESQ improved |
TFGA-Net consistently demonstrates superior performance in signal-to-distortion ratio (SI-SDR), perceptual evaluation of speech quality (PESQ), and intelligibility metrics (STOI, ESTOI), signifying both higher signal fidelity and speech intelligibility across datasets.
5. Broader Implications and Applications
TFGA-Net’s design enables several significant practical and research advances:
- Brain-Controlled Speaker Extraction: Directly utilizes listener EEG to guide target speech extraction, paving the way for “brain-driven” hearing devices.
- Assistive Hearing Technology: Real-time decoding of auditory attention supports enhanced selective hearing for users with hearing impairments, especially in multi-talker ("cocktail party") scenarios.
- Advanced Signal Processing: Multi-scale feature extraction and graph-based modeling introduce robust strategies for exploiting non-Euclidean and spatially complex physiological signals.
- Progress in Auditory Attention Decoding: By fusing EEG and speech via MossFormer2, TFGA-Net significantly refines selective attention decoding, impacting both theoretical and applied research in neuro-acoustic processing.
6. Research Context and Significance
TFGA-Net advances EEG-based speaker extraction by explicitly integrating:
- Multi-resolution and spectral EEG features that reflect ongoing neural processes related to auditory attention.
- Cortical topology via symmetric adjacency matrices, enabling biologically informed graph convolutional operations.
- Hybrid separation architecture (MossFormer2) capable of preserving both global context and localized detail, which is crucial for seamless speech separation with EEG guidance.
This architecture addresses long-standing challenges in aligning neural and acoustic representations for auditory attention decoding and suggests further work in optimizing graph modeling, separator architecture, and real-world deployment for wearable devices. A plausible implication is the expansion of TFGA-Net's principles to other cross-modal decoding problems involving neural and sensory data.