TFGA-Net Model for EEG-Guided Speaker Extraction

Updated 21 October 2025

TFGA-Net is an integrated deep learning model that fuses multi-scale temporal-frequency EEG and acoustic features using graph attention mechanisms for brain-controlled speaker extraction.
It employs adaptive convolutional layers, graph convolutions, and self-attention to effectively model cortical topology and enhance signal-to-distortion ratios.
The model's superior performance in SI-SDR, STOI, and PESQ metrics highlights its potential for real-time assistive hearing devices and advanced neuro-acoustic applications.

TFGA-Net, or Temporal-Frequency Graph Attention Network, is an integrated deep learning architecture designed for brain-controlled speaker extraction using electroencephalography (EEG) signals. This model leverages the neural activity of listeners to extract the target speaker's voice from complex auditory scenes. TFGA-Net systematically addresses the challenge of mapping common information between EEG and speech by employing multi-scale temporal-frequency feature extraction, graph-based modeling of cortical topology, and advanced fusion mechanisms. The architecture demonstrates consistent improvements over state-of-the-art methods in multiple objective benchmarks for the task of EEG-driven speech separation.

1. Architectural Composition

TFGA-Net is structured into four main modules:

Speech Encoder: A single-layer 1D convolution with ReLU activation transforms the raw speech mixture $X \in \mathbb{R}^{B \times 1 \times T_s}$ to $X' = \mathrm{ReLU}(\mathrm{Conv1D}(X)) \in \mathbb{R}^{B \times C \times D}$ , with kernel and stride chosen to downsample temporal resolution.
EEG Encoder: The EEG signal $E \in \mathbb{R}^{B \times C \times T_e}$ $E \in R^{B \times C \times T_{e}}$ is processed in two branches:
- Temporal branch: Five parallel Conv1D layers with exponentially decaying kernel sizes yield $E_T^k = \mathrm{ELU}(\mathrm{BN}(\mathrm{Conv1d}(E, S_T^k)))$ for $k \in \{1, \ldots, 5\}$ .
- Frequency branch: Per-channel STFT yields power spectral density (PSD) and differential entropy (DE), averaged into canonical bands ( $\delta, \theta, \alpha, \beta, \gamma$ ) to form $E_F \in \mathbb{R}^{C \times D_F}$ .
Graph Convolution and Self-Attention:
- Temporal and frequency features are structured as graphs: each EEG channel is a node, edges from cortical connectivity, initial adjacency matrix $A$ .
- Graph convolution is performed per view:
$\tilde{E}_i = \varepsilon(D_i^{-1/2} A_i D_i^{-1/2} \varepsilon(E_i W_{i1}) W_{i2} + E_i),\quad i \in \{T, F\}$

where $\varepsilon$ is BN+ELU, $W_{i1}$ and $W_{i2}$ are learned weights. - Concatenated features are passed through positionally encoded self-attention: $\tilde{E} = \mathrm{SA}(\mathrm{PE}_C(\mathrm{concat}(\tilde{E}_T, \tilde{E}_F)))$ .
Speaker Extraction Network:
- Fused features $X'_{{\rm fuse}} = \mathrm{Conv1D}(\mathrm{concat}(X', \tilde{E}))$ guide separation.
- Separation employs MossFormer2: combines MossFormer module (local full attention, linearized global attention, convolutional gating) and RNN-Free Recurrent module (dilated FSMN, gated convolution).
Speech Decoder: Masked speech embedding $\hat{S} = X' \odot M$ is inverse-mapped to waveform $\hat{s}$ using transposed 1D convolution.

2. Temporal-Frequency Feature Extraction in EEG

TFGA-Net's EEG encoder is specifically engineered to capture rich neural representations:

Temporal Extraction: Multiple Conv1D layers with kernel sizes $S_T^k$ decreasing exponentially provide broad and fine-grained temporal features, concatenated and merged via pointwise convolution for a unified representation.
Frequency Feature Extraction: STFT per channel, followed by PSD and DE calculation, averages EEG power within five canonical frequency bands, resulting in spectral features.
Cortical Topology and Non-Euclidean Modeling: EEG electrodes structured as graph nodes, adjacency matrices ( $A_T$ , $A_F$ ) model spatial connectivities. GCN application mitigates Euclidean constraints and facilitates long/short-distance brain network representation.
Global Dependency Modeling: Self-attention over concatenated graph features and positional encodings captures multichannel global dependencies not available from convolution alone.

3. Fusion of EEG and Speech Features

TFGA-Net integrates neural and acoustic information through systematic fusion:

Concatenation and Integration: Speech features $X'$ and EEG embedding $\tilde{E}$ are concatenated, followed by Conv1D to merge and compress channel information.
Advanced Separation via MossFormer2: This separator utilizes:
- MossFormer submodule: Applies full attention in local chunks and linearized global attention, refined by convolutional gating; e.g., $O = X'' + {\rm ConvM}(\sigma((U \odot AV) \odot AU))$ , where $U,V$ are projected features, $A$ is attention, $\sigma$ is sigmoid, $\odot$ is elementwise product.
- RNN-Free Recurrent submodule: Uses dilated FSMN for memory and gating, $U = \mathrm{ConvU}(X)$ , $Y = \mathrm{DilatedFSMN}(V)$ , $O = X + (U \odot Y)$ .
Preservation of Rhythmic and Prosodic Speech Features: MossFormer2 is adept at maintaining speech rhythm and prosody as well as suppressing interference—critical for context-accurate speaker extraction.

4. Quantitative Performance Evaluation

TFGA-Net has been benchmarked against state-of-the-art baselines on two datasets:

Dataset	TFGA-Net SI-SDR	Key Baselines (SI-SDR)	Additional Metrics
Cocktail Party	15.91 dB	UBESD: 8.54; BASEN: 11.56; M3ANet: 13.95	STOI, ESTOI, PESQ improved
KUL	16.9 dB	UBESD: 6.1; NeuroHeed: 14.6	STOI, ESTOI, PESQ improved

TFGA-Net consistently demonstrates superior performance in signal-to-distortion ratio (SI-SDR), perceptual evaluation of speech quality (PESQ), and intelligibility metrics (STOI, ESTOI), signifying both higher signal fidelity and speech intelligibility across datasets.

5. Broader Implications and Applications

TFGA-Net’s design enables several significant practical and research advances:

Brain-Controlled Speaker Extraction: Directly utilizes listener EEG to guide target speech extraction, paving the way for “brain-driven” hearing devices.
Assistive Hearing Technology: Real-time decoding of auditory attention supports enhanced selective hearing for users with hearing impairments, especially in multi-talker ("cocktail party") scenarios.
Advanced Signal Processing: Multi-scale feature extraction and graph-based modeling introduce robust strategies for exploiting non-Euclidean and spatially complex physiological signals.
Progress in Auditory Attention Decoding: By fusing EEG and speech via MossFormer2, TFGA-Net significantly refines selective attention decoding, impacting both theoretical and applied research in neuro-acoustic processing.

6. Research Context and Significance

TFGA-Net advances EEG-based speaker extraction by explicitly integrating:

Multi-resolution and spectral EEG features that reflect ongoing neural processes related to auditory attention.
Cortical topology via symmetric adjacency matrices, enabling biologically informed graph convolutional operations.
Hybrid separation architecture (MossFormer2) capable of preserving both global context and localized detail, which is crucial for seamless speech separation with EEG guidance.

This architecture addresses long-standing challenges in aligning neural and acoustic representations for auditory attention decoding and suggests further work in optimizing graph modeling, separator architecture, and real-world deployment for wearable devices. A plausible implication is the expansion of TFGA-Net's principles to other cross-modal decoding problems involving neural and sensory data.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to TFGA-Net Model.