EEG-ConvTransformer: Hybrid CNN-Transformer Model

Updated 7 April 2026

EEG-ConvTransformer is a hybrid neural architecture that fuses convolutional layers and Transformer self-attention modules to capture local and global EEG signal features.
It employs serial, block-wise, and dual-branch integration strategies—including channel-attention and macaron-style modules—to refine feature granularity and interpretability.
The model delivers state-of-the-art performance in tasks like epilepsy detection, visual stimuli classification, and motor imagery decoding by balancing spatiotemporal processing.

An EEG-ConvTransformer is a hybrid neural network architecture that fuses convolutional neural networks (CNNs) with Transformer-style self-attention modules, specifically tailored for learning spatiotemporal features in electroencephalography (EEG) signals. The architectural objective is to leverage the complementary inductive biases of CNNs (locality, translation invariance) and Transformers (global dependency modeling via self-attention) to achieve superior accuracy and efficiency compared to either component alone in tasks such as epilepsy detection, single-trial visual classification, motor imagery decoding, and abnormality detection. Recent architectures instantiate this fusion via architectural stacks that alternate or parallelize convolutional and attention-based operations, often implementing additional mechanisms such as channel-wise attention, macaron-style feed-forward sublayers, and multi-scale convolutional modules to further refine feature granularity and physiological interpretability.

1. Architectural Principles and Variants

EEG-ConvTransformer architectures exhibit two dominant integration patterns: (i) serial hybrids, in which CNN stages serve as local feature tokenizers and feed their outputs into Transformer encoders; and (ii) block-wise hybrids, which nest convolutional and self-attention modules within each encoder block, often with residual or macaron-style wrappers. More advanced topologies also employ dual-branch designs that explicitly separate temporal and spatial inference.

A canonical serial hybrid is given by the EEG-ConvTransformer for single-trial visual classification (Bagchi et al., 2021), where sequential 1D/3D convolutional layers extract local spatiotemporal embeddings from C-channel EEG segments, which are then further processed by a stack of Transformer encoder blocks. Contrasting this, EENED (Liu et al., 2023) and DBConformer (Wang et al., 26 Jun 2025) implement block-level integration: each encoder block incorporates a local convolutional module (e.g., pointwise and depthwise convolutions) directly following or interleaved with multi-head self-attention (MHSA), maintaining both scale-resolved and long-range information.

A third paradigm, typified by DBConformer (Wang et al., 26 Jun 2025), splits processing into dual parallel branches: a temporal Conformer branch (T-Conformer) models long-range sequence dependencies, and a spatial Conformer branch (S-Conformer) targets inter-channel (spatial) interactions. Outputs are fused via concatenation and channel-wise attention mechanisms.

The following table summarizes prominent EEG-ConvTransformer variants:

Model	Hybridization Mode	Unique Components
EEG-ConvTransformer (Bagchi et al., 2021)	Serial (Conv→Trans)	Conv1D/CFE in FFN, spatial patch tokens
EENED (Liu et al., 2023)	Block-wise (Macaron)	PW-FF, MHSA, ConvModule in each encoder
DBConformer (Wang et al., 26 Jun 2025)	Parallel dual-branch	Temporal/Spatial branches, Channel-attention

2. Mathematical Building Blocks

EEG-ConvTransformer networks rely on a sequence of linear projections, convolutional mappings, and attention operations to structure their dataflow. The essential operations are as follows:

a) Convolutional Feature Extraction

Local temporal or spatiotemporal convolutions $\mathrm{Conv1D}$ or $\mathrm{Conv3D}$ extract token embeddings from raw EEG:

$H^{(1)} = \sigma(W^{(1)} * X + b^{(1)}), \quad H^{(2)} = \sigma(W^{(2)} * H^{(1)} + b^{(2)})$

where $X\in\mathbb{R}^{B\times C\times T}$ and $*$ denotes (possibly grouped or depthwise) 1D/3D convolution.

b) Multi-Head Self-Attention (MHSA)

$Q = XW^Q, \quad K = XW^K, \quad V = XW^V$

$\mathrm{head}_h = \mathrm{softmax}\left(\frac{Q_hK_h^T}{\sqrt{d_k}}\right)V_h$

$\mathrm{MHSA}(X) = \mathrm{Concat}(\mathrm{head}_1,\dots,\mathrm{head}_H)W^O$

This mechanism allows modeling long-range dependencies among temporal tokens or spatial channel embeddings.

c) Convolutional Feed-Forward and Augmentation

Standard MLPs in Transformer FFN modules are typically replaced with convolution-based operators, e.g.,

$\mathrm{CFE}(U) = \sigma(W_a * U + b_a) + \sigma(W_b * U + b_b)$

with multiple kernel sizes, or (in macaron-style blocks) half-step residual pointwise feed-forward layers interleaved with attention and convolution modules:

$f_\text{ff1} = f^{(e-1)} + \frac{1}{2}\cdot\mathrm{Dropout}(\mathrm{FF}(f^{(e-1)}))$

and similarly for $\mathrm{Conv3D}$ 0 after the conv module.

d) Channel- and Token-wise Attention

DBConformer and related models introduce channel-attention modules:

$\mathrm{Conv3D}$ 1

$\mathrm{Conv3D}$ 2

to adaptively weight spatial channel contributions before fusing temporal and spatial representations.

3. Data Pipelines and Preprocessing Procedures

EEG-ConvTransformer models operate on segmented, preprocessed EEG windows. A standard pipeline consists of:

Bandpass filtering (e.g., 0.5–70 Hz for EENED (Liu et al., 2023))
Notch filtering (e.g., 50 Hz to suppress power-line interference)
Artifact rejection (e.g., amplitude thresholding, interpolation)
Segmentation into fixed or sliding windows (length varies by application; 1–23 s or as little as 220 ms for visual classification (Sharma et al., 2024))
Channel-wise normalization (typically z-score)

Certain frameworks (e.g., DBConformer (Wang et al., 26 Jun 2025)) further apply Euclidean alignment or spatial filtering during input embedding, reflecting dataset-specific variance or harmonization procedures.

4. Training Regimes and Hyperparameterization

Training protocols generally adhere to the following structure:

Optimizer: Adam or AdamW, with typical hyperparameters $\mathrm{Conv3D}$ 3.
Batch sizes: 32, 64, or higher, subject to GPU memory and dataset scale.
Epochs: 50–200, with early stopping on validation loss.
Dropout: $\mathrm{Conv3D}$ 4 in attention/feed-forward layers; higher ( $\mathrm{Conv3D}$ 5) in spatial/temporal convolutions for robustness.
Model sizes: Embedding $\mathrm{Conv3D}$ 6– $\mathrm{Conv3D}$ 7; attention heads $\mathrm{Conv3D}$ 8 (but as low as one in TransformEEG (Pup et al., 10 Jul 2025)); encoder blocks $\mathrm{Conv3D}$ 9.
Data augmentation: Not strictly universal, but methods such as mixup, temporal masking, or phase noise are documented as future improvements (Liu et al., 2023).

Ablations consistently demonstrate that both convolutional and attention components are essential: removing convolutional modules or CFE leads to a drop in accuracy of 1–4% absolute in state-of-the-art configurations (Bagchi et al., 2021, Liu et al., 2023).

5. Empirical Performance and Comparative Evaluation

EEG-ConvTransformer architectures consistently deliver state-of-the-art results versus canonical CNNs, RNNs, and pure-transformer models across modalities and benchmarks:

Epilepsy detection (EENED (Liu et al., 2023), Andrzejak dataset): Accuracy: 0.982, F1: 0.989; both Dense-CNN and Transformer-only baselines plateau at $H^{(1)} = \sigma(W^{(1)} * X + b^{(1)}), \quad H^{(2)} = \sigma(W^{(2)} * H^{(1)} + b^{(2)})$ 0.
Single-trial visual stimuli classification (Bagchi & Bathula (Bagchi et al., 2021)): EEG-ConvTransformer achieves $H^{(1)} = \sigma(W^{(1)} * X + b^{(1)}), \quad H^{(2)} = \sigma(W^{(2)} * H^{(1)} + b^{(2)})$ 1 accuracy (macro-F1 $H^{(1)} = \sigma(W^{(1)} * X + b^{(1)}), \quad H^{(2)} = \sigma(W^{(2)} * H^{(1)} + b^{(2)})$ 2), outperforming pure CNN ( $H^{(1)} = \sigma(W^{(1)} * X + b^{(1)}), \quad H^{(2)} = \sigma(W^{(2)} * H^{(1)} + b^{(2)})$ 3) and Transformer-only ( $H^{(1)} = \sigma(W^{(1)} * X + b^{(1)}), \quad H^{(2)} = \sigma(W^{(2)} * H^{(1)} + b^{(2)})$ 4) variants.
Motor imagery decoding (DBConformer (Wang et al., 26 Jun 2025)): MI-CO accuracy $H^{(1)} = \sigma(W^{(1)} * X + b^{(1)}), \quad H^{(2)} = \sigma(W^{(2)} * H^{(1)} + b^{(2)})$ 5, MI-CV $H^{(1)} = \sigma(W^{(1)} * X + b^{(1)}), \quad H^{(2)} = \sigma(W^{(2)} * H^{(1)} + b^{(2)})$ 6, MI-LOSO $H^{(1)} = \sigma(W^{(1)} * X + b^{(1)}), \quad H^{(2)} = \sigma(W^{(2)} * H^{(1)} + b^{(2)})$ 7, exceeding IFNet and EEG Conformer while using an order of magnitude fewer parameters.
Abnormality detection (CwA-T (Zhao et al., 2024), TUH Abnormal): $H^{(1)} = \sigma(W^{(1)} * X + b^{(1)}), \quad H^{(2)} = \sigma(W^{(2)} * H^{(1)} + b^{(2)})$ 8 accuracy with $H^{(1)} = \sigma(W^{(1)} * X + b^{(1)}), \quad H^{(2)} = \sigma(W^{(2)} * H^{(1)} + b^{(2)})$ 9M params, outperforming Deep4Conv and FusionCNN.
Generalizability (TransformEEG (Pup et al., 10 Jul 2025), multi-dataset PD detection): median balanced accuracy $X\in\mathbb{R}^{B\times C\times T}$ 0, lowest interquartile range $X\in\mathbb{R}^{B\times C\times T}$ 1 among seven deep-learning baselines on N-LNSO splits.

A summarized result excerpt:

Model	Dataset/Task	Accuracy (%)	F1
EENED	Epilepsy (Andrzejak)	98.2	98.9
EEG-ConvTransf.	Visual Stimuli	52.3	0.54
DBConformer	MI (CO)	80.6	—
CwA-T	TUH Abnormal EEG	85.0	76.2 (Sens)
TransformEEG	PD Detection	80.1 (BalAcc)	—

These performance gains are attributed to the ability of EEG-ConvTransformer models to integrate local EEG transients (spikes, sharp waves, spectral bursts) and longer-range temporal or spatial dependencies (cross-sensor interactions, epileptiform patterns, distributed visual representations).

6. Interpretability, Visualization, and Physiological Validity

Neurophysiological interpretability is facilitated both by explicit architectural design (e.g., channelwise autoencoders (Zhao et al., 2024), channel-attention (Wang et al., 26 Jun 2025), density-purified intermediate outputs (Ding et al., 2024)) and by systematic post-hoc visualization:

t-SNE embedding projections from T-Conformer and fused outputs demonstrate increased class separability when convolutional and attention-based features are combined (Wang et al., 26 Jun 2025).
Channel-attention maps consistently peak at task-relevant electrodes, e.g., sensorimotor cortex (C3, Cz, C4) for MI, confirming alignment with established neurophysiological priors.
Saliency and heatmaps: DIP-derived (IP-Unit) visualizations in EEG-Deformer (an advanced ConvTransformer) highlight relevant cortical regions for cognitive/attention/fatigue workload tasks (Ding et al., 2024).

Multi-scale and channel-specific feature modules permit per-electrode or cross-temporal feature inspection, supporting hypotheses about underlying neural dynamics beyond black-box predictions.

7. Limitations and Future Research Directions

Identified limitations include:

A reliance on convolutional receptive fields for temporal locality in the absence of explicit positional encoding (Liu et al., 2023);
Potential hyperparameter sensitivity, particularly kernel sizes and the number/depth of hybrid blocks;
Evaluation predominantly on single/public datasets, requiring further demonstration of generalization to clinical, artifact-laden, or class-imbalanced regimes.

Prioritized research avenues:

Integration of learned or relative positional encodings to mitigate implicit inductive biases;
Multi-scale and multi-branch convolution modules for finer temporal granularity;
Direct modeling of channel-graph or spatial adjacency matrices (e.g., via spatial transformers or GNN modules);
Data augmentation (e.g., mixup, spectrogram perturbation) for broader robustness;
Extending to multi-class, multi-modal, and cross-domain tasks (e.g., epilepsy subtypes, sleep staging, multimodal fMRI-EEG fusion).

A plausible implication is that the hybrid ConvTransformer architecture—when equipped with advanced regularization, physiologically-aligned attention heads, and appropriately large datasets—offers a tractable foundation for both accurate EEG decoding and mechanistic neuroscientific inquiry.

References:

EENED: End-to-End Neural Epilepsy Detection based on Convolutional Transformer (Liu et al., 2023)
DBConformer: Dual-Branch Convolutional Transformer for EEG Decoding (Wang et al., 26 Jun 2025)
EEG-ConvTransformer for Single-Trial EEG based Visual Stimuli Classification (Bagchi et al., 2021)
Channelwise AutoEncoder with Transformer for EEG Abnormality Detection (Zhao et al., 2024)
TransformEEG: Towards Improving Model Generalizability in Deep Learning-based EEG Parkinson's Disease Detection (Pup et al., 10 Jul 2025)