TinyMyo: Compact EMG Foundation Model

Updated 4 July 2026

TinyMyo is a lightweight EMG foundation model that leverages a Transformer encoder pretrained using masked reconstruction on heterogeneous public EMG datasets.
It employs channel-independent patching and concatenated embeddings to retain electrode-specific information, facilitating accurate multi-task performance.
Designed for edge deployment, TinyMyo demonstrates state-of-the-art results in hand gesture, kinematic, and speech tasks on ultra-low-power microcontrollers.

Searching arXiv for the cited TinyMyo-related papers and adjacent work to ground the article in the provided literature. TinyMyo most commonly denotes a tiny foundation model for surface electromyography (EMG): a lightweight Transformer encoder pretrained in a self-supervised manner on heterogeneous public EMG datasets and then adapted with minimal task-specific heads to multiple downstream problems, including hand gesture classification, hand kinematic regression, speech production, and speech recognition (Fasulo et al., 5 Dec 2025). Its stated purpose is to address poor generalization across subjects, recording systems, and acquisition protocols while remaining small enough for edge deployment, including deployment on the ultra-low-power GAP9 microcontroller. In the supplied literature, the same name also appears in a distinct biophysical context as a Python-based automated tracking workflow for myosin II filaments, so the term is not globally unambiguous (Mosby et al., 2020).

1. Scope, problem setting, and nomenclature

TinyMyo was introduced against a familiar EMG systems problem: surface EMG is non-invasive and useful in biomechanics, rehabilitation, prosthetic control, and human-machine interaction, but the signals vary because of noise, motion artifacts, cross-talk, inter-subject and intra-subject variability, and differences in electrode layouts, sampling rates, amplification, filtering, devices, and acquisition protocols (Fasulo et al., 5 Dec 2025). The work positions these factors as the main obstacle to transferable EMG modeling.

Within that framing, TinyMyo is defined as a single pretrained EMG backbone that can be reused across multiple downstream tasks. The paper explicitly contrasts this with conventional supervised models that are usually task-specific, and with prior EMG foundation models that tend to be large, limited to one downstream task, or not deployable on embedded platforms. The backbone is therefore intended to serve as a common representation layer rather than as a single-purpose decoder.

A recurrent misconception is to treat TinyMyo as simply a small gesture classifier. The supplied literature does not support that reduction. The model is evaluated on classification, regression, speech production, and speech recognition, and the authors frame it as a flexible resource rather than a one-task architecture. A second source of confusion is terminological: in a separate paper summary, “TinyMyo” refers to a tracking routine for myosin II filaments rather than to EMG processing, which is a genuinely different use of the same label (Mosby et al., 2020).

2. Transformer architecture and self-supervised pretraining

TinyMyo is a Transformer encoder foundation model operating directly on time-domain EMG. The backbone has 8 pre-LayerNorm Transformer blocks, embedding dimension 192, 3 attention heads, and about 3.6M parameters. The pretraining decoder is deliberately tiny, at only 3.9k parameters, so that most representational burden remains in the encoder (Fasulo et al., 5 Dec 2025).

The input representation uses a channel-independent patching strategy. For an input waveform

$\mathbf{X}\in\mathbb{R}^{T\times C},$

TinyMyo splits each channel separately into temporal patches,

$\mathbf{P}\in\mathbb{R}^{C\times N_p\times L}, \qquad N_p=\left\lfloor\frac{T-L}{S}\right\rfloor+1.$

During pretraining, the paper uses $T=1000$ samples, $L=S=20$ , and $C=16$ , yielding a total sequence length of $N=800$ . Each patch is projected by a shared linear map, and positional structure is handled with RoPE rather than learned positional embeddings. The authors explicitly justify the channel-independent tokenization by noting that each channel corresponds to a distinct anatomical electrode site, so inter-channel mixing is deferred to attention rather than imposed at tokenization time.

Pretraining uses masked reconstruction with a 50% mask ratio. The losses are

$\mathcal{L}_{\mathrm{masked} = \frac{1}{|\mathcal{MASK}|} \sum_{(c,i)\in\mathcal{MASK} \mathrm{SmoothL1}\big(\mathbf{P}_{c,i},\hat{\mathbf{P}_{c,i}\big),$

$\mathcal{L}_{\mathrm{visible} = \frac{1}{|\mathcal{M}|} \sum_{(c,i)\in\mathcal{M} \mathrm{SmoothL1}\big(\mathbf{P}_{c,i},\hat{\mathbf{P}_{c,i}\big),$

and

$\mathcal{L}_{\mathrm{total} = \mathcal{L}_{\mathrm{masked} + 0.1\cdot \mathcal{L}_{\mathrm{visible}.$

The downweighting of the visible-patch term is used to reduce trivial copying.

The pretraining corpus consists of NinaPro DB6, NinaPro DB7, and EMG2Pose. The summaries specify 10 subjects and 14 channels for DB6, 22 subjects and 12 channels for DB7, and 193 participants and 16 channels for EMG2Pose. Preprocessing uses a Butterworth band-pass filter $20{-}450$ Hz, a 50 Hz notch filter, and channel-wise min-max normalization to $\mathbf{P}\in\mathbb{R}^{C\times N_p\times L}, \qquad N_p=\left\lfloor\frac{T-L}{S}\right\rfloor+1.$ 0. Windows have 1000 samples = 500 ms with 50% overlap; channels fewer than 16 are zero-padded; no data augmentation is used; optimization uses AdamW, batch size 512, peak LR $\mathbf{P}\in\mathbb{R}^{C\times N_p\times L}, \qquad N_p=\left\lfloor\frac{T-L}{S}\right\rfloor+1.$ 1, and 50 epochs with 10 warmup epochs. Training was performed on CSCS Alps with NVIDIA GH200 GPUs in DDP mode (Fasulo et al., 5 Dec 2025).

3. Downstream interface, benchmarks, and empirical results

After pretraining, the decoder is discarded and the encoder is reused with minimal task-specific heads. Channel-wise embeddings are fused by concatenation rather than averaging,

$\mathbf{P}\in\mathbb{R}^{C\times N_p\times L}, \qquad N_p=\left\lfloor\frac{T-L}{S}\right\rfloor+1.$ 2

and then temporally average-pooled before the task head. The paper reports that concatenation preserved electrode-specific information and empirically outperformed mean fusion (Fasulo et al., 5 Dec 2025).

The downstream evaluation spans four task families and several datasets acquired with different sensing locations and hardware platforms.

Task family	Dataset(s)	Reported result
Hand gesture classification	NinaPro DB5, EPN-612, UCI-EMG	$\mathbf{P}\in\mathbb{R}^{C\times N_p\times L}, \qquad N_p=\left\lfloor\frac{T-L}{S}\right\rfloor+1.$ 3, $\mathbf{P}\in\mathbb{R}^{C\times N_p\times L}, \qquad N_p=\left\lfloor\frac{T-L}{S}\right\rfloor+1.$ 4, $\mathbf{P}\in\mathbb{R}^{C\times N_p\times L}, \qquad N_p=\left\lfloor\frac{T-L}{S}\right\rfloor+1.$ 5
Hand kinematic regression	NinaPro DB8	MAE $\mathbf{P}\in\mathbb{R}^{C\times N_p\times L}, \qquad N_p=\left\lfloor\frac{T-L}{S}\right\rfloor+1.$ 6, RMSE $\mathbf{P}\in\mathbb{R}^{C\times N_p\times L}, \qquad N_p=\left\lfloor\frac{T-L}{S}\right\rfloor+1.$ 7, $\mathbf{P}\in\mathbb{R}^{C\times N_p\times L}, \qquad N_p=\left\lfloor\frac{T-L}{S}\right\rfloor+1.$ 8
Speech production	Gaddy Silent Speech Dataset	WER $\mathbf{P}\in\mathbb{R}^{C\times N_p\times L}, \qquad N_p=\left\lfloor\frac{T-L}{S}\right\rfloor+1.$ 9
Speech recognition	Gaddy Silent Speech Dataset	WER $T=1000$ 0

For hand gesture classification, the headline results are reported as state of the art compared to previous FM-based work on NinaPro DB5, UCI-EMG, and EPN-612. On NinaPro DB5, TinyMyo reports FS, 200 ms: $T=1000$ 1 and FT, 200 ms: $T=1000$ 2, with the latter marked as the best result in the table. On EPN-612, it reports FS, 1000 ms: $T=1000$ 3 and FT, 1000 ms: $T=1000$ 4. On UCI-EMG, the paper notes that the task is simple enough that pretraining offers little extra benefit, with FS, 1000 ms: $T=1000$ 5 and FT, 1000 ms: $T=1000$ 6.

The paper also introduces the Generic Neuromotor Interface as a new downstream benchmark for EMG foundation models. There, TinyMyo FS achieves CLER $T=1000$ 7 and TinyMyo FT achieves CLER $T=1000$ 8, while the original LSTM baselines are 0.1596 for windowed inference and 0.1819 for full sequence. The authors attribute the weaker FT result to a mismatch between TinyMyo’s bidirectional pretraining and the benchmark’s causal/windowed inference constraint.

For hand kinematic regression on NinaPro DB8, the best reported result is TinyMyo FT, 1000 ms, with MAE $T=1000$ 9, RMSE $L=S=20$ 0, and $L=S=20$ 1. The paper explicitly notes that comparisons with subject-specific prior work such as TEMPONet TCN or event-based linear regression are not directly apples-to-apples, because TinyMyo is evaluated in an across-subject setting.

For speech production, the model is used in a three-stage pipeline: EMG-to-MFCC transduction, HiFi-GAN vocoding, and ASR-based transcription for evaluation. The transduction model uses 3 residual convolutional blocks, downsamples from 1600 to 200 samples, and predicts 26 MFCCs. On the Gaddy Silent Speech Dataset, TinyMyo reports FS: $L=S=20$ 2 and FT: $L=S=20$ 3 WER. The transduction component is about 4.5M parameters, compared with 54M in the Gaddy baseline, described as roughly 91.7% reduction in model size.

For speech recognition, the same dataset is used with a CTC-based head, a 4-gram LLM on LibriSpeech, and beam search decoding. The reported result is TinyMyo FT: $L=S=20$ 4 WER. The paper presents this as a smaller, more deployable EMG-only alternative to larger or multimodal systems such as MONA and MONA LISA (Fasulo et al., 5 Dec 2025).

4. Edge deployment and systems profile

A central claim of TinyMyo is deployability at the edge. The paper reports, to the best of its knowledge, the first deployment of an EMG foundation model on an ultra-low-power microcontroller, specifically GreenWaves GAP9 (Fasulo et al., 5 Dec 2025).

The deployment target has external L3 HyperRAM, 1.5 MB on-chip L2, and 128 kB L1. Because attention has $L=S=20$ 5 scaling and the deployed gesture model uses sequence length 800, the implementation relies on a hierarchical streaming toolchain: weights and activations reside in L3, slabs are streamed into L2 on demand, tiles are moved from L2 to L1 for computation, and double buffering is used to hide transfer latency. The summaries also note offline liveness analysis, static memory arena allocation, and integer-only implementations of softmax, LayerNorm, and GELU.

Quantization is INT8 for weights, activations, MHSA, GELU, and LayerNorm. Only LayerNorm scale/shift remain FP32, and the final classification layer accumulates to FP32 logits for comparison. The per-block computation profile is dominated by attention and MLP projections: Q/K/V projections: 88M MACs, QK scores: 123M MACs, AV context: 123M MACs, output projection: 29M MACs, FC1: 118M MACs, and FC2: 118M MACs.

The reported deployment result on GAP9 at 370 MHz using all 9 cores is ~12.2 s inference time, 0.44 J energy, and 36.45 mW average power envelope. These numbers establish deployability, but the paper also explicitly notes a limitation: inference is still too slow for seamless real-time use. That caveat is important. The work demonstrates feasibility under severe power constraints, not completion of the latency problem.

5. Relation to compact wearable EMG literature

TinyMyo belongs to a longer line of efforts to make EMG inference compact, accurate, and deployable. Earlier work on “Compact Deep Neural Networks for Computationally Efficient Gesture Classification From Electromyography Signals” proposed a compact SqueezeNet-inspired CNN with a customized Temporal Fire Module for 15-class hand gesture recognition from Myo and Delsys recordings. On the Myo Armband, that model reported 84.2 ± 6% accuracy versus 70.5 ± 7% for an SVM, with only 5,889 parameters and 7.89 ms inference on Jetson TX2 (Hartwell et al., 2018). In historical terms, that paper established that deep EMG models could be much smaller than prior alternatives without relinquishing embedded relevance.

A parallel line emphasized direct wearable control rather than representation learning. “Teleoperated Robotic Arm Movement Using EMG Signal With Wearable MYO Armband” used the wireless Myo gesture armband on the forearm, with 8 sEMG sensors, 200 Hz sampling, time-domain features MAV, WL, RMS, AR, ZC, and SSC, and a command pipeline for a 5-DoF robotic arm. The reported offline accuracies were SVM: 96.57%, LDA: 96.01%, and KNN: 92.67%, with optimized online-mode performance of SVM: 95.27%, LDA: 94.53%, and KNN: 89.43% (Hassan et al., 2018). This work showed that Myo-class hardware could support stable real-time teleoperation with classical pattern-recognition methods.

More recent work shifts attention from compactness alone to session drift and long-term usability. “Lightweight Test-Time Adaptation for EMG-Based Gesture Recognition” uses a compact Temporal Convolutional Network with about 47k parameters and reports that a baseline model achieves about 85.12% intra-session accuracy but only 56.61% inter-session accuracy on NinaPro DB6. Three lightweight adaptation strategies then improve inter-session performance: causal adaptive BatchNorm to about 68.93%, GMM + alignment to about 69.78%, and meta-learning to about 80.30% (Touko et al., 7 Jan 2026). A plausible implication is that TinyMyo’s pretrained backbone and lightweight test-time adaptation are complementary rather than competing ideas: one addresses transferable representation learning, the other addresses deployment-time non-stationarity.

Against this background, TinyMyo can be situated as a shift from task-specific or session-specific decoders toward a shared, pretrained, edge-aware EMG backbone. Its novelty is not merely smaller size, although 3.6M parameters remains modest by foundation-model standards; it is the combination of multi-task reuse, cross-device transfer, and demonstrated MCU deployment (Fasulo et al., 5 Dec 2025).

6. Distinct biophysical use of the name

In a separate literature, TinyMyo refers not to EMG modeling but to a Python-based automated single-particle tracking workflow for myosin II filaments in actin networks tethered to supported lipid bilayers (Mosby et al., 2020). This workflow was demonstrated on interferometric scattering microscopy (iSCAT) videos and was designed to track both position and orientation, and to classify motion into diffusive and processive segments.

The pipeline has three stages: detection of filament-like objects, linking detections into tracks, and classifying track segments by displacement correlations and angular motion. Detection uses SEP (derived from SExtractor) and accepts a candidate if area > 40 pixels and each pixel intensity is at least 1.5 $L=S=20$ 6 above local background. Two area cutoffs are then applied: a minimum reliable area of

$L=S=20$ 7

and a maximum single-filament cutoff of

$L=S=20$ 8

Tracks are linked with a centroid-displacement constraint

$L=S=20$ 9

an area-consistency rule, and a minimum duration of at least two frames.

Its motion-state classifier uses a 2 s dwell-time criterion and displacement-correlation thresholds derived from the measured average diffusivity

$C=16$ 0

Processive points are accepted only in regions of at least five consecutive processive points over about 1 s. Orientation handling extends the angle domain beyond SEP’s $C=16$ 1 output so that angular diffusion can be treated continuously. The paper reports high-throughput analysis with 15,647 filaments in one condition and 14,016 filaments in another, and states that fewer than 4% of tracks ended within 6 pixels of the image boundary.

This biophysical TinyMyo is therefore unrelated to the EMG foundation model except in the broad methodological sense that both are compact, automated, and intended for high-throughput or edge-constrained analysis. For disambiguation, the context is decisive: in EMG and wearable computing, TinyMyo denotes the Transformer-based foundation model; in acto-myosin microscopy, it denotes the myosin II filament tracking workflow.