CSST: Cross-Subject Self-Training

Updated 4 July 2026

Cross-Subject Self-Training (CSST) is a self-adaptation strategy that uses pseudo-labels to bridge distribution gaps between source and target subjects.
It integrates techniques like active learning, feature alignment, and contrastive learning to manage label scarcity and subject variability.
Empirical results in HAR and SSVEP settings demonstrate that CSST achieves near upper-bound performance with minimal target-domain labelled data.

Searching arXiv for the cited CSST-related papers and close context. Cross-Subject Self-Training (CSST) denotes a class of cross-subject adaptation procedures in which a model trained on source subjects, or inherited from a previous iteration, generates pseudo-labels on a target subject and then uses those pseudo-labels to improve target-domain performance. In recent arXiv usage, the term appears in at least two distinct but related forms: as the self-training module inside ActiveSelfHAR for cross-subject human activity recognition (HAR), and as the name of a full cross-subject domain adaptation framework for steady-state visually evoked potential (SSVEP) classification. In both cases, CSST addresses subject-dependent distribution shift and the scarcity or cost of target-domain annotation, but the concrete mechanisms differ substantially (Wei et al., 2023, Wang et al., 29 Jan 2026).

1. Problem setting and terminological scope

The common problem addressed by CSST is cross-subject transfer: a model is trained using data from one set of subjects and then adapted to a new subject whose data distribution differs from the source distribution. In the HAR setting, this is described as the “cross-subject issue when adapting to new users,” which hinders real-world deployment despite strong laboratory performance (Wei et al., 2023). In the SSVEP setting, the source domain is defined as

$\mathcal D_S = \{(x_i^s, y_i^s)\}_{i=1}^{n_s},$

with labeled trials $x_i^s\in\mathbb{R}^{N_C\times N_P}$ and labels $y_i^s\in\{1,\dots,M\}$ , while the target domain is

$\mathcal D_T = \{x_j^t\}_{j=1}^{n_t},$

with unlabeled trials $x_j^t\in\mathbb{R}^{N_C\times N_P}$ , under the assumption that $P_S(x)\neq P_T(x)$ . The learning objective is to minimize the target error

$\mathcal{E}_T(f)=\mathbb{E}_{x\sim P_T}\bigl[\mathbf{1}\{f(x)\neq y^\star\}\bigr]$

using labeled source data and unlabeled target data (Wang et al., 29 Jan 2026).

Within this scope, CSST should not be treated as a single canonical algorithm. In ActiveSelfHAR, it is a module interleaved with active learning and neighbor-based augmentation. In SSVEP classification, it is the central framework, comprising Pre-Training with Adversarial Learning (PTAL), Dual-Ensemble Self-Training (DEST), and Time-Frequency Augmented Contrastive Learning (TFA-CL). This suggests that “CSST” functions more as a methodological label for self-training-centered cross-subject adaptation than as a uniquely standardized architecture.

2. CSST in ActiveSelfHAR for cross-subject HAR

In ActiveSelfHAR, the CSST module begins from a teacher network with logits $f_\theta(x)\in\mathbb{R}^K$ and class probabilities

$p_\theta(y=k\mid x_i)=\frac{\exp\bigl(f_\theta(x_i)_k\bigr)}{\sum_{l=1}^K\exp\bigl(f_\theta(x_i)_l\bigr)}.$

The confidence score is defined as

$\mathrm{conf}(x_i)=\max_{1\le k\le K} p_\theta(y=k\mid x_i).$

Given a threshold $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 0, such as $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 1 for EMG data and $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 2 for IMU data, incremented by $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 3 each iteration, a target-domain sample is included in the self-training set if $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 4, with pseudo-label

$x_i^s\in\mathbb{R}^{N_C\times N_P}$ 5

At iteration $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 6, with unlabeled target pool $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 7 and model parameters $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 8, the self-training set is formed from those $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 9 whose maximum class probability under $y_i^s\in\{1,\dots,M\}$ 0 exceeds the iteration-specific threshold $y_i^s\in\{1,\dots,M\}$ 1 (Wei et al., 2023).

The method then extracts penultimate-layer feature vectors $y_i^s\in\{1,\dots,M\}$ 2 and reduces them via PCA to 3D. For each class $y_i^s\in\{1,\dots,M\}$ 3, it defines a class center $y_i^s\in\{1,\dots,M\}$ 4 as the sample in the pseudo-labeled set whose feature is closest to the classwise mean feature among samples with pseudo-label $y_i^s\in\{1,\dots,M\}$ 5. The collection of these centers forms the center set $y_i^s\in\{1,\dots,M\}$ 6. The remaining unlabeled pool is

$y_i^s\in\{1,\dots,M\}$ 7

For each sample in $y_i^s\in\{1,\dots,M\}$ 8, distances to the two nearest class centers are computed, and informativeness is scored using those distances. Samples are grouped according to which pair of centers they lie between. From each of the $y_i^s\in\{1,\dots,M\}$ 9 boundary groups, the top- $\mathcal D_T = \{x_j^t\}_{j=1}^{n_t},$ 0 most informative points are selected for true-label querying; an example given is $\mathcal D_T = \{x_j^t\}_{j=1}^{n_t},$ 1 per group.

A queried point $\mathcal D_T = \{x_j^t\}_{j=1}^{n_t},$ 2 is then used to recruit spatio-temporal neighbors. The neighbor score is

$\mathcal D_T = \{x_j^t\}_{j=1}^{n_t},$ 3

where $\mathcal D_T = \{x_j^t\}_{j=1}^{n_t},$ 4 is the window timestamp and $\mathcal D_T = \{x_j^t\}_{j=1}^{n_t},$ 5 is a small time window, for example $\mathcal D_T = \{x_j^t\}_{j=1}^{n_t},$ 6. All $\mathcal D_T = \{x_j^t\}_{j=1}^{n_t},$ 7 with $\mathcal D_T = \{x_j^t\}_{j=1}^{n_t},$ 8 inherit the label of $\mathcal D_T = \{x_j^t\}_{j=1}^{n_t},$ 9. The resulting augmented core set $x_j^t\in\mathbb{R}^{N_C\times N_P}$ 0 is the union of queried samples and these neighbors.

The full algorithm alternates pseudo-labeling, center computation, active querying, core-set augmentation, and fine-tuning. “Update” is defined as fine-tuning only the fully-connected layers while freezing shared CNN layers. The student model at iteration $x_j^t\in\mathbb{R}^{N_C\times N_P}$ 1 is trained on $x_j^t\in\mathbb{R}^{N_C\times N_P}$ 2 using the task loss

$x_j^t\in\mathbb{R}^{N_C\times N_P}$ 3

3. CSST in SSVEP classification: FBEA, PTAL, DEST, and TFA-CL

In the SSVEP formulation, CSST is a two-stage cross-subject domain adaptation framework built on self-training. It is preceded by Filter-Bank Euclidean Alignment (FBEA), which exploits SSVEP frequency information. Each trial is decomposed into $x_j^t\in\mathbb{R}^{N_C\times N_P}$ 4 sub-bands, giving

$x_j^t\in\mathbb{R}^{N_C\times N_P}$ 5

After reshaping to $x_j^t\in\mathbb{R}^{N_C\times N_P}$ 6, the covariance is

$x_j^t\in\mathbb{R}^{N_C\times N_P}$ 7

The reference covariance is

$x_j^t\in\mathbb{R}^{N_C\times N_P}$ 8

and alignment is performed by

$x_j^t\in\mathbb{R}^{N_C\times N_P}$ 9

The stated purpose is to reduce inter-subject distributional shift while preserving cross-band correlations (Wang et al., 29 Jan 2026).

The first stage, PTAL, uses a feature extractor $P_S(x)\neq P_T(x)$ 0, classifier $P_S(x)\neq P_T(x)$ 1, and domain discriminator $P_S(x)\neq P_T(x)$ 2, with a Gradient-Reversal Layer between $P_S(x)\neq P_T(x)$ 3 and $P_S(x)\neq P_T(x)$ 4. The supervised source loss is

$P_S(x)\neq P_T(x)$ 5

where

$P_S(x)\neq P_T(x)$ 6

The adversarial loss is

$P_S(x)\neq P_T(x)$ 7

minimized with respect to $P_S(x)\neq P_T(x)$ 8 and maximized with respect to $P_S(x)\neq P_T(x)$ 9. The overall pre-training objective is

$\mathcal{E}_T(f)=\mathbb{E}_{x\sim P_T}\bigl[\mathbf{1}\{f(x)\neq y^\star\}\bigr]$ 0

The second stage, DEST, instantiates two copies of $\mathcal{E}_T(f)=\mathbb{E}_{x\sim P_T}\bigl[\mathbf{1}\{f(x)\neq y^\star\}\bigr]$ 1: a student with parameters $\mathcal{E}_T(f)=\mathbb{E}_{x\sim P_T}\bigl[\mathbf{1}\{f(x)\neq y^\star\}\bigr]$ 2, updated by gradient descent, and a teacher with parameters $\mathcal{E}_T(f)=\mathbb{E}_{x\sim P_T}\bigl[\mathbf{1}\{f(x)\neq y^\star\}\bigr]$ 3, updated by exponential moving average,

$\mathcal{E}_T(f)=\mathbb{E}_{x\sim P_T}\bigl[\mathbf{1}\{f(x)\neq y^\star\}\bigr]$ 4

For each target trial $\mathcal{E}_T(f)=\mathbb{E}_{x\sim P_T}\bigl[\mathbf{1}\{f(x)\neq y^\star\}\bigr]$ 5, three views are formed: the original $\mathcal{E}_T(f)=\mathbb{E}_{x\sim P_T}\bigl[\mathbf{1}\{f(x)\neq y^\star\}\bigr]$ 6 and two augmentations $\mathcal{E}_T(f)=\mathbb{E}_{x\sim P_T}\bigl[\mathbf{1}\{f(x)\neq y^\star\}\bigr]$ 7. A projection head $\mathcal{E}_T(f)=\mathbb{E}_{x\sim P_T}\bigl[\mathbf{1}\{f(x)\neq y^\star\}\bigr]$ 8 produces embeddings

$\mathcal{E}_T(f)=\mathbb{E}_{x\sim P_T}\bigl[\mathbf{1}\{f(x)\neq y^\star\}\bigr]$ 9

Predicted one-hot labels are

$f_\theta(x)\in\mathbb{R}^K$ 0

They are fused with cosine-similarity weights

$f_\theta(x)\in\mathbb{R}^K$ 1

Only pseudo-labels whose top score exceeds confidence threshold $f_\theta(x)\in\mathbb{R}^K$ 2 are retained. The target self-training loss is

$f_\theta(x)\in\mathbb{R}^K$ 3

TFA-CL augments each pseudo-labeled target sample along the temporal axis, using jitter and cropping, and along the frequency axis, using additive noise in sub-bands. For a batch of $f_\theta(x)\in\mathbb{R}^K$ 4 augmented embeddings $f_\theta(x)\in\mathbb{R}^K$ 5, with temperature $f_\theta(x)\in\mathbb{R}^K$ 6, the set of positives for anchor $f_\theta(x)\in\mathbb{R}^K$ 7 is

$f_\theta(x)\in\mathbb{R}^K$ 8

The supervised contrastive loss for anchor $f_\theta(x)\in\mathbb{R}^K$ 9 is

$p_\theta(y=k\mid x_i)=\frac{\exp\bigl(f_\theta(x_i)_k\bigr)}{\sum_{l=1}^K\exp\bigl(f_\theta(x_i)_l\bigr)}.$ 0

The total self-training objective is

$p_\theta(y=k\mid x_i)=\frac{\exp\bigl(f_\theta(x_i)_k\bigr)}{\sum_{l=1}^K\exp\bigl(f_\theta(x_i)_l\bigr)}.$ 1

4. Shared principles and major divergences

Both instantiations of CSST are organized around the same central operation: pseudo-labeling of unlabeled target-domain data under a confidence criterion. In ActiveSelfHAR, pseudo-labels are produced by the model trained in the previous iteration or the source domain, and samples are admitted according to $p_\theta(y=k\mid x_i)=\frac{\exp\bigl(f_\theta(x_i)_k\bigr)}{\sum_{l=1}^K\exp\bigl(f_\theta(x_i)_l\bigr)}.$ 2 (Wei et al., 2023). In the SSVEP framework, pseudo-labels are produced from three views and accepted only when the fused prediction exceeds $p_\theta(y=k\mid x_i)=\frac{\exp\bigl(f_\theta(x_i)_k\bigr)}{\sum_{l=1}^K\exp\bigl(f_\theta(x_i)_l\bigr)}.$ 3 (Wang et al., 29 Jan 2026). In both cases, the method assumes that high-confidence predictions are sufficiently reliable to seed further target adaptation.

The principal divergence lies in how each method treats the unlabeled remainder and the role of supervision. ActiveSelfHAR is explicitly hybrid: it combines self-training with active learning, queries true labels for ambiguous target samples, and propagates these labels through spatio-temporal grouping. The SSVEP CSST framework is instead built around source-supervised pre-training, unsupervised target pseudo-labeling, teacher-student refinement, and contrastive regularization. It does not include a human-in-the-loop querying stage.

The feature-space machinery also differs. ActiveSelfHAR constructs class centers from pseudo-labeled target features, uses nearest and second-nearest centers to identify boundary regions, and enlarges queried sets through local spatio-temporal structure. The SSVEP framework performs covariance-based alignment before training, adversarial domain confusion during pre-training, multi-view pseudo-label fusion in DEST, and supervised contrastive learning on pseudo-labeled target embeddings. This suggests that CSST is best understood as a self-training core that can be embedded in substantially different adaptation pipelines.

A common misconception would be to interpret CSST as necessarily label-free. The HAR variant contradicts that interpretation because it sparsely acquires actual labels through active learning. A second misconception would be to assume that CSST implies a fixed architectural recipe. The two arXiv usages show that the label encompasses at least one module-level design and one end-to-end framework.

5. Empirical behavior across HAR and SSVEP

The HAR study evaluates on DSADS, PAMAP-2, and an in-house EMG dataset. DSADS contains 8 subjects, 12 daily/sports activities, and 100 Hz IMUs. PAMAP-2 contains 7 subjects, 5 activities, and 100 Hz IMUs. The EMG dataset contains 10 subjects, 5 locomotion classes plus 4 gait-phase classes, and 1,111 Hz EMG. Reported metrics are precision, recall, accuracy on the held-out subject, percent of target samples actually labeled, and total adaptation time (Wei et al., 2023).

The SSVEP study evaluates on Benchmark and BETA. Benchmark contains 35 subjects, 64-channel EEG, 40-class SSVEP at 8–15.8 Hz, and 6 blocks at 5 s. BETA contains 70 subjects, the same 40 classes, and 4 blocks at 2 s or 3 s. Preprocessing selects 9 occipital channels, uses latency $p_\theta(y=k\mid x_i)=\frac{\exp\bigl(f_\theta(x_i)_k\bigr)}{\sum_{l=1}^K\exp\bigl(f_\theta(x_i)_l\bigr)}.$ 4 s, varies window length $p_\theta(y=k\mid x_i)=\frac{\exp\bigl(f_\theta(x_i)_k\bigr)}{\sum_{l=1}^K\exp\bigl(f_\theta(x_i)_l\bigr)}.$ 5 from 0.4 to 1 s, and decomposes signals into $p_\theta(y=k\mid x_i)=\frac{\exp\bigl(f_\theta(x_i)_k\bigr)}{\sum_{l=1}^K\exp\bigl(f_\theta(x_i)_l\bigr)}.$ 6 filter-bank sub-bands. The protocol is Leave-One-Subject-Out, with batch size 64, 500 epochs of PTAL and 500 epochs of DEST, Adam optimizer with learning rate $p_\theta(y=k\mid x_i)=\frac{\exp\bigl(f_\theta(x_i)_k\bigr)}{\sum_{l=1}^K\exp\bigl(f_\theta(x_i)_l\bigr)}.$ 7 and weight decay $p_\theta(y=k\mid x_i)=\frac{\exp\bigl(f_\theta(x_i)_k\bigr)}{\sum_{l=1}^K\exp\bigl(f_\theta(x_i)_l\bigr)}.$ 8, pseudo-label threshold $p_\theta(y=k\mid x_i)=\frac{\exp\bigl(f_\theta(x_i)_k\bigr)}{\sum_{l=1}^K\exp\bigl(f_\theta(x_i)_l\bigr)}.$ 9, EMA momentum $\mathrm{conf}(x_i)=\max_{1\le k\le K} p_\theta(y=k\mid x_i).$ 0, and contrastive hyperparameters $\mathrm{conf}(x_i)=\max_{1\le k\le K} p_\theta(y=k\mid x_i).$ 1 and $\mathrm{conf}(x_i)=\max_{1\le k\le K} p_\theta(y=k\mid x_i).$ 2. The reported metrics are accuracy and information transfer rate (ITR), with

$\mathrm{conf}(x_i)=\max_{1\le k\le K} p_\theta(y=k\mid x_i).$ 3

These settings establish that the two CSST lines are empirically evaluated under very different signal modalities and operational criteria (Wang et al., 29 Jan 2026).

Setting	Reported result	Interpretation
DSADS, fully supervised fine-tuning	$\mathrm{conf}(x_i)=\max_{1\le k\le K} p_\theta(y=k\mid x_i).$ 4 accuracy, $\mathrm{conf}(x_i)=\max_{1\le k\le K} p_\theta(y=k\mid x_i).$ 5 labeled	Upper bound
DSADS, ActiveSelfHAR (3 iters)	$\mathrm{conf}(x_i)=\max_{1\le k\le K} p_\theta(y=k\mid x_i).$ 6, $\mathrm{conf}(x_i)=\max_{1\le k\le K} p_\theta(y=k\mid x_i).$ 7 labeled, $\mathrm{conf}(x_i)=\max_{1\le k\le K} p_\theta(y=k\mid x_i).$ 8 min	Near upper bound
PAMAP-2, fine-tuning	$\mathrm{conf}(x_i)=\max_{1\le k\le K} p_\theta(y=k\mid x_i).$ 9, $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 00 labeled	Reference
PAMAP-2, ActiveSelfHAR	$x_i^s\in\mathbb{R}^{N_C\times N_P}$ 01, $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 02 labeled, $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 03 min	Slightly above fine-tuning
EMG locomotion/phase, fine-tuning	$x_i^s\in\mathbb{R}^{N_C\times N_P}$ 04, $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 05 labeled	Reference
EMG locomotion/phase, ActiveSelfHAR	$x_i^s\in\mathbb{R}^{N_C\times N_P}$ 06, $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 07 labeled, $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 08 min	Near reference
Benchmark, signal length $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 09 s	CSST ITR $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 10 vs SFDA $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 11, $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 12	Higher ITR
BETA, signal length $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 13 s	CSST ITR $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 14 vs SFDA $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 15, $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 16	Higher ITR

For HAR, the main empirical claim is that the method presents similar HAR accuracies to the upper bound, defined as fully supervised fine-tuning, with less than $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 17 labeled target data, and that it improves data efficiency and time cost. It also outperforms purely unsupervised UDA (MCD), pure self-training (SelfHAR), and pure active learning (AL-HAR) in the accuracy-label trade-off, while keeping total adaptation time on the order of 1–15 minutes (Wei et al., 2023).

For SSVEP, the principal comparative result is state-of-the-art performance across varying signal lengths on Benchmark and BETA. The ablation reported for Benchmark at 1 s is especially informative: baseline self-training yields $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 18, adding PTAL yields $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 19, adding DEST yields $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 20, adding FBEA yields $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 21, and adding FBEA+TFA-CL yields $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 22. This does not support a simplistic assumption that every additional component is individually monotonic in effect; rather, it indicates that the contribution of components is interaction-dependent (Wang et al., 29 Jan 2026).

6. Limitations, interpretive cautions, and prospective directions

The SSVEP framework states several limitations directly. It relies on sufficiently strong pseudo-labels, and extremely low target SNR may still degrade performance. It requires additional hyperparameter tuning for different hardware or new paradigms. Extending the approach to online continuous adaptation and real-time BCI deployment remains future work. The computational profile is also explicit: two-stage training with adversarial min-max optimization, teacher-student updates, and contrastive pairs adds approximately 20–30% overhead, although inference remains a single forward pass of $x_i^s\in\mathbb{R}^{N_C\times N_P}$ 23 (Wang et al., 29 Jan 2026).

The HAR study emphasizes a different operational point: the method is intended to enable user-independent HAR in smart healthcare systems and wireless body sensor networks by combining pseudo-label bootstrapping, sparse querying of ambiguous regions, and spatio-temporal propagation of true labels. Its reported adaptation times remain within minutes, which is central to the claim of practical data efficiency (Wei et al., 2023).

Taken together, these works indicate that CSST is not reducible to pseudo-label recycling alone. In one line of work, it is strengthened by active querying and structured neighbor propagation; in the other, by alignment, adversarial pre-training, dual-ensemble refinement, and contrastive learning. A plausible implication is that the viability of CSST depends less on the generic use of pseudo-labels than on the mechanisms used to control pseudo-label noise under cross-subject shift.

Markdown Report Issue Upgrade to Chat

References (2)

ActiveSelfHAR: Incorporating Self Training into Active Learning to Improve Cross-Subject Human Activity Recognition (2023)

Rethinking Self-Training Based Cross-Subject Domain Adaptation for SSVEP Classification (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cross-Subject Self-Training (CSST).