sEMG Gesture Recognition Advances
- sEMG gesture recognition is a technique that decodes intentional limb movements from muscle electrical activity using high-density electrode arrays.
- It leverages advanced deep learning architectures and precise preprocessing to extract robust spatiotemporal features for real-time gesture classification.
- Applications span myoelectric prostheses, human-computer interfaces, exoskeleton control, and neurorehabilitation, addressing challenges like signal variability and latency.
Surface electromyography (sEMG) gesture recognition is a computational paradigm focused on inferring intentional hand or limb gestures from the electrical potentials recorded non-invasively at the skin surface over active muscles. sEMG gesture recognition has become foundational for myoelectric prosthesis control, human-computer interaction, exoskeletons, and neurorehabilitation interfaces. This problem is characterized by high spatiotemporal signal complexity, substantial inter-session and inter-subject variability, and requirements for low-latency, robust inference in real-world scenarios.
1. Signal Acquisition, Representation, and Preprocessing
sEMG signals are typically acquired from forearm or hand muscle groups using arrays of surface electrodes (e.g., 8 to 128 channels) digitized at 200–2048 Hz. High-density sEMG (HD-sEMG) systems (e.g., two 8×8 grids) provide rich spatial information crucial for discriminating intricate gestures (Zhong et al., 2023). Standard preprocessing pipelines include band-pass filtering (e.g., 20–450 Hz), power-line notch filtering, channel-wise normalization, and windowing (100–750 ms, 50–75% overlap). Some pipelines apply explicit feature extraction (e.g., MAV, RMS, ZC, WL, AR coefficients, PSD), while contemporary deep-learning approaches often operate directly on multichannel time series or image-representations (e.g., 16×8 HD-sEMG maps) (Islam et al., 2023, Josephs et al., 2020).
Key preprocessing steps:
- Window segmentation: Overlapping windows, typically 250–640 ms, to balance latency and discriminability (Qiao et al., 2024, Zhong et al., 2023).
- Feature extraction: Time- and frequency-domain descriptors, channel covariance matrices, functional connectivity graphs (Dash et al., 20 Oct 2025, Zhong et al., 2023).
- Normalization: Per-window/channel z-score or min-max normalization for distributional stability (Gowda et al., 2023).
Encoding muscle activation as spatial graphs (nodes = electrodes, edges = functional links) or embedding covariance matrices on the SPD manifold are methods specifically developed to capture non-Euclidean and global spatial dependencies (Zhong et al., 2023, Dash et al., 20 Oct 2025, Gowda et al., 2023).
2. Deep Learning Architectures and Feature Learning
Multiple classes of architectures have advanced sEMG gesture recognition:
- CNNs and All-ConvNets: 2D/3D convolutional layers extract local spatial/temporal patterns. Recent networks use purely convolutional architectures, global pooling, and parameter pruning for efficiency, enabling state-of-the-art inter-session/inter-subject transfer performance at <0.5M parameters (Islam et al., 2023).
- Spatio-Temporal GCNs: STGCN-GR models hand muscle activation as functional graphs, alternating temporal convolutions (Conv1D + Gated Linear Units) with spatial graph convolutions using a k-NN adjacency (k=2 optimal on HD-sEMG) (Zhong et al., 2023). This directly encodes channel topology and boosts accuracy for >60-gesture-vocabulary tasks.
- Geometric/Manifold Learning: TMKNet embeds multi-kernel features onto the SPD manifold, applies manifold-specific nonlinearity (ReEig), and domain-specific batch normalization (parallel transport in tangent space) for session-invariant decoding (Dash et al., 20 Oct 2025). Similar Riemannian embedding and SVM/MDM classification give ≥92% accuracy on multi-session datasets (Gowda et al., 2023).
- Transformers, Attention, and Wavelet Networks: Compact Transformer models leveraging learnable temporal embeddings (Time2Vec) and normalized additive space–time fusion achieve up to 95.7% F1-score on 10-class, two-channel sEMG (Hristov et al., 2 Feb 2026). Lightweight hybrid wavelet–Transformer models (WaveFormer) achieve 95% with only 3.1M parameters and 6.75 ms latency (Chen et al., 12 Jun 2025).
- Hybrid and Hierarchical Models: Multi-branch architectures combine TCNs, separable CNNs, BiLSTM, and channel attention for long-/short-term spatiotemporal feature extraction. This is critical for >90% decoding accuracy over variable-density, 52-class tasks across Ninapro DB2–DB5 (Shin et al., 4 Apr 2025).
- Sequential Modeling / Recurrent Units: SRU, GRU, and LSTM models allow efficient modeling of temporal dependencies, often with global pooling or temporal attention. Dilated bi-LSTM stacked encoders, combined with per-subject multiplicative embeddings, further enhance transferability and reduce calibration for large gesture sets (Azar et al., 2023, Sosin et al., 2020).
3. Domain Adaptation, Transfer Learning, and Robustness
Distribution shift—across sessions, postures, and subjects—presents a major challenge for sEMG. Major approaches include:
- Statistical and Deep Transfer: Freezing lower network layers (feature reuse), fine-tuning higher layers, and judiciously mixing source/target data allows All-ConvNet+TL to outperform much larger models under severe session and subject shift, especially with minimal new-target data (Islam et al., 2023).
- Unsupervised Domain-Adaptation: Domain-specific batch normalization on SPD manifolds (TMKNet), adversarial domain adaptation (gradient reversal layer in SRU/GRU frameworks), and pseudo-label-based source-free SNN adaptation (SpGesture) reduce need for target labels and enable unsupervised real-world adaptation (Dash et al., 20 Oct 2025, Sosin et al., 2020, Guo et al., 2024).
- Rapid Calibration Protocols: Fast fine-tuning on a few user-specific trials can restore pre-trained Transformer/attention model accuracy from <25% (zero-shot) to >96.9% F1 in <10 s of new data (Hristov et al., 2 Feb 2026).
- Domain-Invariant Representations: Feature-aggregation strategies respecting topological, spectral, and physiological invariants (muscle groupings, SPD embeddings) significantly boost cross-session and cross-subject accuracy (Dash et al., 20 Oct 2025, Gowda et al., 2023).
Performance under domain shift:
| Model / Protocol | Inter-Session | Inter-Subject | Reference |
|---|---|---|---|
| All-ConvNet+TL (transfer, HD) | 94.91% | 94.94% | (Islam et al., 2023) |
| TMKNet (manifold + DA) | 70.9% (DB6) | ≤66% (LOSO) | (Dash et al., 20 Oct 2025) |
| SRU + ADA (recurrent + adv.) | −1.2/−1.0 RMSE | +1.2/+1.2 RMSE | (Sosin et al., 2020) |
| SpGesture (SNN + SFDA) | +4.10% abs. | ≥89.3% | (Guo et al., 2024) |
| L-EMGNet (cross-day, gesture-free) | 68.0% | 55.6% | (Li et al., 2024) |
4. Benchmark Datasets and Evaluation Protocols
Robust sEMG gesture recognition systems are validated on large, multi-session, multi-subject datasets featuring dozens to hundreds of classes, varied postures, and cross-day or cross-limb partitioning:
- High-density sEMG: CapgMyo-65 (128 channels, 65 gestures), enables spatially resolved deep graph modeling (Zhong et al., 2023, Azar et al., 2023).
- Low- to moderate-density: Ninapro DB2/DB4/DB5/DB6 (12–16 channels, 50–65 gestures), used for multi-session, inter-subject, and prosthetic-relevant benchmarks (Shin et al., 4 Apr 2025, Dash et al., 20 Oct 2025, Hristov et al., 2 Feb 2026, Gowda et al., 2023).
- Transfer/Adaptation Studies: Protocols include leave-one-session/subject-out, adaptation on a single trial, or testing on amputees using healthy pre-training (Islam et al., 2023, Qiao et al., 2024).
- Specialized sets: FORS-EMG (multi-orientation), putEMG (8-class x 44 subjects, 24 channels, 2 sessions), and natural typing recognition (multi-hour, 32-channel) address position, posture, or high-throughput decoding (Rumman et al., 2024, Kaczmarek et al., 2019, Crouch et al., 2021).
Accuracy, F1-score, balanced accuracy, and confusion matrices are commonly reported, with cross-validation (e.g., 5- or 10-fold) and stratified/majority-voting over windows for robust performance estimation.
5. Model Efficiency, Real-Time Deployment, and Practical Constraints
Advances in model optimization have enabled deployment of high-accuracy sEMG gesture recognition on embedded and wearable hardware:
- Parameter and compute reduction: All-ConvNet/All-ConvNet+TL, Bioformers, and WaveFormer achieve top accuracy with 0.46–3.1M parameters and ≤10 ms inference, using all-conv or quantized small Transformer blocks (Islam et al., 2023, Chen et al., 12 Jun 2025, Burrello et al., 2022).
- Spiking neural networks (SNNs): Event-based SNNs with spiking attention (SpGesture) reduce latency and energy requirements by 5–10×, leveraging binary, sparse spike processing and specialized hardware (Guo et al., 2024).
- Latency and control loop: Efficient models (STGCN-GR, All-ConvNet, Bioformer) reach <300 ms per window, supporting responsive prosthetic or exoskeleton control (Zhong et al., 2023, Burrello et al., 2022).
- Edge deployment: INT8 quantization, memory pruning, and tailored microcontroller implementations (e.g., GAP8 PULP) enable deployments at <100 kB model size and <0.15 mJ per inference (Burrello et al., 2022, Chen et al., 12 Jun 2025).
- Real-world signals: Accuracy drops substantially with forearm orientation changes, major electrode shift, or across users without calibration. Methods combining spatially aware architectures, robust features, and domain adaptation provide practical mitigation (Rumman et al., 2024, Dash et al., 20 Oct 2025, Zhong et al., 2023).
6. Key Challenges, Limitations, and Future Directions
Despite significant gains, several fundamental challenges and open directions remain:
- Distribution shift: Systematic cross-day, cross-session, cross-orientation, and inter-user variability continue to limit generalization. Advanced unsupervised and source-free adaptation methods are under active investigation (Dash et al., 20 Oct 2025, Guo et al., 2024, Li et al., 2024).
- Scalability to large gesture vocabularies: While STGCN-GR and sequential decoders have increased the feasible gesture set to 52–65, most transfer/adaptation techniques are proven only up to 10–18 classes (Zhong et al., 2023, Azar et al., 2023, Dash et al., 20 Oct 2025).
- Physiological interpretability: SPD manifold learning and muscle-group-aware convolutions attempt to bring model representations closer to physiological ground truth, supporting more robust and explainable decision-making (Dash et al., 20 Oct 2025, Gowda et al., 2023).
- Minimal-label and few-shot learning: Protocols exploiting a handful of calibration trials, or none at all (transfer via metrics or parallel transport), are crucial for practical, user-friendly deployments (Hristov et al., 2 Feb 2026, Azar et al., 2023).
- Integration with multi-modal sensing: Fusing IMU, force sensors, or vision systems remains an underexplored route to resolving ambiguities and further increasing robustness, particularly for dynamic, context-aware gesture decoding (Li et al., 2024, Dash et al., 20 Oct 2025).
- Gesture-free and covert intention recognition: New research targets recognition of user intention without overt gestures, e.g., through isometric contraction and intention decoding under natural motion (Li et al., 2024).
Long-term, the field is progressing from isolated, static gesture sets toward real-world, continuous, user-adaptive, and real-time myoelectric interfaces with minimal burden for the end user.
7. Performance Comparison and Benchmark Summary
Selected recent works and performance on representative tasks/datasets:
| Model/Approach | Dataset/Class set | Accuracy / F1 | Scenario | Reference |
|---|---|---|---|---|
| STGCN-GR (spatio-temp GCN) | CapgMyo-65, 65-class | 91.07% ± 4.13% | HD, 5-fold CV | (Zhong et al., 2023) |
| TMKNet (SPD, muscle-aware, DA) | Ninapro DB6, 7–10-class | 70.86% ± 13.32% | Inter-session | (Dash et al., 20 Oct 2025) |
| All-ConvNet+TL (transfer) | CapgMyo, 8–12-class | 94.91% | Inter-session/subject | (Islam et al., 2023) |
| LightGBM ensemble (optimized) | NinaproDB7, 18-class | 90.28% | Continuous, transfer | (Qiao et al., 2024) |
| WaveFormer (wavelet+Transformer) | EPN612, 6-class | 95.0% (6.75ms) | Real-time, INT8 | (Chen et al., 12 Jun 2025) |
| Bioformer (ultra-low-power Transformer) | Ninapro DB6, 8-class | 64.7% (sub-3ms/0.14mJ) | Embedded MCU | (Burrello et al., 2022) |
| SpGesture (SNN+SFDA) | 10-class, postural | 89.26% (SSFA) | Cross-posture | (Guo et al., 2024) |
| Hierarchical multi-stream network | Ninapro DB2, 50-class | 96.41% | Complex temporal, HD | (Shin et al., 4 Apr 2025) |
| Attention-based feedforward | NinaPro DB5, 53-class | 87–91% | End-to-end, simple net | (Josephs et al., 2020) |
| TMA maps + compact CNN | 5-class, Myo, 8-ch | 94.08% | Real-time (5.5ms) | (Silva et al., 2020) |
| LDA + SNTDF (FORS-EMG) | 12-class, multi-orient | 88.6% F1 | Cross-orientation | (Rumman et al., 2024) |
| 2D-CNN/L-EMGNet (gesture-free intention) | 6-class, intention | 91.1% (single-day) | No-gesture, L-EMGNet | (Li et al., 2024) |
These comparisons, while not exhaustive, demonstrate the trajectory from static-featured, classical pipelines to robust, efficient, and adaptive deep spatiotemporal architectures uniquely tailored to the sEMG gesture recognition problem.