Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EEG-Transformer: Self-attention from Transformer Architecture for Decoding EEG of Imagined Speech (2112.09239v1)

Published 15 Dec 2021 in cs.HC, cs.SD, and eess.AS

Abstract: Transformers are groundbreaking architectures that have changed a flow of deep learning, and many high-performance models are developing based on transformer architectures. Transformers implemented only with attention with encoder-decoder structure following seq2seq without using RNN, but had better performance than RNN. Herein, we investigate the decoding technique for electroencephalography (EEG) composed of self-attention module from transformer architecture during imagined speech and overt speech. We performed classification of nine subjects using convolutional neural network based on EEGNet that captures temporal-spectral-spatial features from EEG of imagined speech and overt speech. Furthermore, we applied the self-attention module to decoding EEG to improve the performance and lower the number of parameters. Our results demonstrate the possibility of decoding brain activities of imagined speech and overt speech using attention modules. Also, only single channel EEG or ear-EEG can be used to decode the imagined speech for practical BCIs.

Citations (37)

Summary

  • The paper presents a novel EEG-Transformer that combines CNN-based feature extraction with a self-attention mechanism to decode both overt and imagined speech from EEG signals.
  • It leverages a hybrid architecture inspired by EEGNet and Transformer encoders to extract temporal, spectral, and spatial patterns with reduced computational complexity.
  • It achieved accuracies of 49.5% for overt speech and 35.07% for imagined speech, underscoring its potential for practical, low-channel brain-computer interface applications.

This paper introduces "EEG-Transformer," a novel approach for decoding imagined and overt speech from electroencephalography (EEG) signals using a self-attention mechanism derived from the Transformer architecture. The primary goal is to improve the performance and practicality of Brain-Computer Interfaces (BCIs) for communication.

The authors address the challenge of accurately recognizing human intentions, specifically speech, from non-invasive EEG recordings. While deep learning models like Convolutional Neural Networks (CNNs) have been applied to EEG decoding, this work explores the benefits of Transformer models, which have shown significant success in natural language processing and other domains.

Methodology and Implementation:

  1. Data Acquisition and Preprocessing:
    • EEG data was collected from nine subjects performing overt and imagined speech tasks.
    • The tasks involved 12 specific words ("ambulance," "clock," "hello," "help me," "light," "pain," "stop," "thank you," "toilet," "TV," "water," "yes") and a resting state, resulting in 13 classification classes.
    • Each trial was 2 seconds long. EEG signals were down-sampled to 250 Hz.
    • Preprocessing involved:
      • Band-pass filtering in the high-gamma band (30-120 Hz) using a 5th order Butterworth filter.
      • Baseline correction using the average of 500 ms before trial onset.
      • Channel selection focused on 10 channels located over Broca's and Wernicke's areas (AF3, F3, F5, FC3, FC5, T7, C5, TP7, CP5, and P5).
      • Artifact removal (EOG, EMG) using Independent Component Analysis (ICA).
    • Software used: Python, Matlab, OpenBMI Toolbox, BBCI Toolbox, and EEGLAB.
  2. Proposed Architecture (EEG-Transformer):
    • The core idea is to combine the feature extraction capabilities of CNNs (specifically, an EEGNet-like structure) with the sequence modeling power of Transformer's self-attention module.
    • Input: Raw EEG signals (C channels × T time points).
    • Initial Convolutional Layers:
      • The framework begins with convolutional layers designed to capture temporal, spectral, and spatial features from the EEG data.
      • The first layer's kernel size is related to the sampling frequency, mimicking a band-pass filter.
    • Self-Attention Module:
      • The output from the convolutional layers (feature maps) is treated as a sequence of vectors.
      • These vectors are linearly embedded, position embeddings are added, and the sequence is fed into a standard Transformer encoder (as shown in Figure 1 of the paper).
      • The Transformer encoder (detailed in Figure 2) consists of multi-head self-attention and feed-forward networks, with residual connections and layer normalization.
      • A learnable "classification token" is added to the sequence, and its corresponding output from the Transformer encoder is used for classification.
    • Output Layer: A classification layer predicts one of the 13 classes.
    • Loss Function: Squared hinge loss, chosen for its similarity to the margin loss of Support Vector Machines (SVMs), which have shown robustness in imagined speech decoding.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
Input EEG (Channels, Timepoints)
   │
   ▼
Temporal Convolution (EEGNet-style, acts like band-pass filter)
   │
   ▼
Spatial/Depthwise Convolution (EEGNet-style, extracts spatial features)
   │
   ▼
Separable Convolution (EEGNet-style, further feature extraction)
   │
   ▼
Feature Maps (Patches/Embeddings)
   │
   ▼
Linear Embedding + Positional Encoding
   │
   ▼
Transformer Encoder Block:
┌───────────────────────────┐
│  Multi-Head Self-Attention│
│            │              │
│  Add & Layer Norm         │
│            │              │
│  Feed Forward Network     │
│            │              │
│  Add & Layer Norm         │
└───────────────────────────┘
   │ (Repeat N times)
   ▼
Output from Classification Token
   │
   ▼
Classification Layer (e.g., Linear + Softmax)
   │
   ▼
Predicted Class (13 classes)

  1. Training and Evaluation:
    • 5-fold cross-validation was used.
    • Models were trained for 1000 epochs.
    • Performance was evaluated based on classification accuracy. The chance level was 11.11% for 13 classes.

Results and Discussion:

  • Decoding Performance:
    • Overt Speech: Achieved an average accuracy of 49.5% across 9 subjects for 13 classes. The EEG signals during overt speech are expected to contain more distinct brain activity related to speech articulation, even after EMG artifact removal.
    • Imagined Speech: Achieved an average accuracy of 35.07% across 9 subjects for 13 classes. This performance is lower than overt speech, as imagined speech relies solely on internal mental processes without muscle movement.
    • The difference in performance between overt and imagined speech was statistically significant (p<0.05p<0.05), but the authors noted it was "not so huge," suggesting the potential of imagined speech BCIs.
  • Benefits of the Attention Module:
    • The paper argues that the self-attention module contributes to reasonable performance while offering advantages such as:
    • Reduced total computational complexity per layer compared to some recurrent or fully convolutional architectures.
    • Increased parallelizability of computations.
    • Shorter path lengths for modeling long-term dependencies in the EEG time series.
  • Practical BCI Implications:
    • The paper emphasizes the potential for developing robust BCI systems with simpler hardware.
    • The results suggest that even with a limited number of channels (10 selected channels), or potentially even single-channel EEG or ear-EEG (though detailed results for these are not extensively presented in the provided text), decoding imagined speech is feasible. This is crucial for user comfort and real-world applicability, particularly referencing the proximity of Broca-Wernicke areas to left ear channels.

Conclusion and Future Directions:

The paper concludes that the proposed attention module based on the Transformer architecture shows promise for decoding imagined speech from EEG signals. The achieved accuracies are considered "reasonable" and highlight the potential of this technology for developing practical communication systems for individuals who cannot speak. Future work includes:

  • Developing architectures with higher performance for imagined speech.
  • Optimizing the parameters of the self-attention module to further increase performance.

Implementation Considerations:

  • Computational Resources: While Transformers can be computationally intensive, the paper suggests their architecture can reduce complexity per layer and allow for more parallelization than RNNs. The specific model size (number of layers, heads, embedding dimensions) would determine the exact requirements.
  • Data Requirements: Training deep learning models, including Transformers, typically requires substantial amounts of data. The paper used 300 trials per condition per subject.
  • Subject Variability: EEG signals are highly subject-specific. The model's performance on new subjects without subject-specific fine-tuning or transfer learning techniques would be a key challenge for practical deployment. The paper reports average accuracies, implying subject-specific models or some form of aggregation.
  • Real-time Performance: For a BCI communication system, real-time inference is critical. The efficiency of the proposed EEG-Transformer, particularly the convolutional front-end and the self-attention mechanism, would need to be optimized for low-latency processing.
  • Channel Selection: The focus on 10 channels, and the mention of single-channel or ear-EEG, points towards efforts to reduce the obtrusiveness of BCI systems. Implementing systems with fewer channels simplifies hardware and improves user experience.

This work demonstrates a step towards integrating advanced deep learning architectures like Transformers into the BCI field, specifically for the challenging task of imagined speech decoding. The combination of CNN-based feature extraction with self-attention offers a powerful framework for capturing relevant spatio-temporal dynamics in EEG data.