Frequency Pretraining (FPT) Methods

Updated 4 June 2026

Frequency pretraining (FPT) is a self-supervised method that uses pairwise relative shifts to capture long-range dependencies and relational geometry.
It has been effectively applied in EEG representation learning, joint alignment, and balanced clustering, often outperforming traditional techniques.
Its methodology integrates transformer-based encoding, spectral initialization, and projected power iterations to achieve efficient global structure recovery.

Frequency pretraining (FPT), also known as PAirwise Relative Shift (PARS) pretraining in self-supervised learning literature, encompasses a family of methods that utilize relative shifts between temporal or categorical elements as explicit training signals. FPT methodologies are designed to endow models with the capacity to internalize global structure—such as long-range dependencies in time series or pairwise relational geometry in discrete assignment tasks—by predicting, regressing, or optimizing over relative shifts between pairs of elements. This paradigm appears centrally in state-of-the-art EEG pretraining for time series representation, joint alignment problems (especially in combinatorial settings), and as a mechanism to regularize pairwise similarities for balanced clustering.

1. Foundations and Objective Formulations

Frequency pretraining approaches operationalize learning via the prediction of relative shifts between elements in a sequence or graph. In the context of self-supervised EEG signal representation learning, FPT tasks compel the encoder to infer the normalized temporal difference between pairs of masked windows, effectuating a direct inductive bias toward global temporal structure. Formally, for an input sequence $X \in \mathbb{R}^T$ , $N$ non-overlapping windows $x_i \in \mathbb{R}^M$ are sampled with associated timestamps $t_i$ , and a subset of windows is "position-masked," i.e., positional information is replaced by a learned mask token. The core target $\theta_{jk} = \frac{t_j - t_k}{T_s} \in [-1,1]$ is then regressed solely from the masked window content, where $T_s$ denotes sequence duration (Sandino et al., 14 Nov 2025).

In joint alignment and assignment problems on graphs, FPT-like methods appear as the recovery of $n$ discrete variables $x_i \in \{1,\dots,m\}$ from noisy pairwise modulo- $m$ difference observations $y_{ij} = (x_i - x_j) \bmod m$ over an observation set $N$ 0 (Chen et al., 2016). Here, the maximum likelihood objective is expressed over all candidate discrete labelings, as: $N$ 1 where $N$ 2 quantifies log-likelihood of the observed shift.

In balanced clustering tasks, FPT manifests concretely in "shifting" pairwise similarity matrices adaptively to achieve regularization and symmetry, with the shift $N$ 3 determined by node-level similarity statistics (Chehreghani, 2021). This results in a new shifted matrix $N$ 4 satisfying $N$ 5 for each $N$ 6, and reframes the clustering objective as a function of these pairwise shifts.

2. Mathematical Architecture and Algorithmic Components

In EEG pretraining, the FPT workflow consists of: patch tokenization, masked positional embedding perturbation, transformer-based encoding of patches, generation of all masked patch pairs, and regression of normalized temporal shifts via a cross-attention decoder. Specifically, encoder outputs $N$ 7 are concatenated into all pairwise feature vectors $N$ 8, projected through $N$ 9, and combined canonically via cross-attention to produce shift predictions $x_i \in \mathbb{R}^M$ 0. The loss is mean squared error $x_i \in \mathbb{R}^M$ 1 (Sandino et al., 14 Nov 2025).

In modulo-shifted alignment, the Projected Power Method (PPM) involves block-lifting the discrete labels into high-dimensional indicators, initializing using a best-rank- $x_i \in \mathbb{R}^M$ 2 SVD-based spectral approximation, and refining via projected power iterations: $x_i \in \mathbb{R}^M$ 3 where $x_i \in \mathbb{R}^M$ 4 is the block matrix of pairwise likelihoods and $x_i \in \mathbb{R}^M$ 5 the simplex, ultimately converging under mild conditions to the unique global optimum (Chen et al., 2016).

For clustering, the local search algorithm proceeds by greedy reassignment of cluster labels to individual data points, using incremental updates on the shifted similarity matrix to evaluate the improvement in the cost function, iterating until a local minimum is achieved (Chehreghani, 2021).

3. Applications in Representation Learning, Alignment, and Clustering

FPT’s principal application in time series domain has been in EEG self-supervised representation learning, where it outperforms masked autoencoders and masked position prediction (e.g., MP3) by encouraging the encoding of long-range temporal dependencies. For instance, in sleep stage decoding with minimal labeled data (sleep staging with only 10 subjects), PARS-pretrained models achieved the highest Cohen’s κ relative to baselines, and maintained or surpassed competing paradigms (MAE, MP3, DropPos, random initialization) in balanced accuracy and κ across larger datasets and transfer settings (Sandino et al., 14 Nov 2025).

In combinatorial settings, PPM-based frequency pretraining is effectively used for discrete joint alignment problems such as 3D shape alignment and multi-view graph matching. Empirical performance demonstrates phase transition behavior matching theoretical recovery thresholds, and substantial computational speedups over SDP-based relaxations, e.g., aligning 50 ShapeNet chairs in 2.4 seconds (versus 900s for SDP) and reducing graph matching errors from 13% to ∼3% (Chen et al., 2016).

In clustering, the adaptively shifted min-cut objective driven by pairwise frequency pretraining wins or ties for cluster agreement metrics (Adjusted Mutual Information, Rand Index, V-measure) on 75–80% of UCI datasets versus standard and advanced baselines, and is robust to negative similarities without thresholding. This unifies balanced cut heuristics with correlation clustering in a data-driven, parameter-free manner (Chehreghani, 2021).

4. Theoretical Guarantees and Convergence Analysis

FPT regimes exhibit strong theoretical performance guarantees. In joint alignment:

Exact recovery is achieved if $x_i \in \mathbb{R}^M$ 6 (KL-separation) in random corruption models, with geometric contraction of misclassification rate per iteration, and $x_i \in \mathbb{R}^M$ 7 algorithmic complexity to reach global optimum. Below this threshold, no algorithm (even in principle) can recover true assignments (Chen et al., 2016).
Computational complexity per iteration is $x_i \in \mathbb{R}^M$ 8 due to FFT-based multiplication, and spectral initialization is dominated by $x_i \in \mathbb{R}^M$ 9 per block power step.

For pairwise shift-based clustering, the local search optimization achieves $t_i$ 0 decrease in duality gap, comparable to Frank–Wolfe with exact line search, whereas generic nonconvex methods provide only $t_i$ 1. Convergence to a local optimum is deterministic due to monotonic improvement at each step (Chehreghani, 2021).

5. Protocols, Hyperparameters, and Empirical Results

In EEG pretraining, the following protocol is instantiated:

Window (patch) size: $t_i$ 2 s, $t_i$ 3 windows sampled uniformly per 30s clip.
Position-mask ratio: $t_i$ 4 (32 masked / 8 unmasked per clip).
AdamW optimizer with learning rate $t_i$ 5, weight decay $t_i$ 6, batch size 512.
Training for 1000 epochs with linear warm-up (100 epochs) and cosine annealing.
Data augmentations: random crop of the 30s window, per-epoch random single-channel selection; at fine-tuning, channel dropout and restoring standard sinusoidal positional encoding (Sandino et al., 14 Nov 2025).

Empirical comparison across four EEG tasks (EESM17, TUAB, TUSZ, PhysioNet-MI) confirms that PARS is optimal or near-optimal in balanced accuracy and $t_i$ 7 in three of four tasks, with pronounced strength in the low-supervision regime.

Task	PARS Balanced Accuracy	Best Baseline
EESM17 (5-class)	$t_i$ 8	$t_i$ 9 (MAE)
TUAB (2-class)	$\theta_{jk} = \frac{t_j - t_k}{T_s} \in [-1,1]$ 0	$\theta_{jk} = \frac{t_j - t_k}{T_s} \in [-1,1]$ 1 (MAE)
TUSZ (2-class)	$\theta_{jk} = \frac{t_j - t_k}{T_s} \in [-1,1]$ 2	$\theta_{jk} = \frac{t_j - t_k}{T_s} \in [-1,1]$ 3 (MAE)
PhysioNet-MI (4-class)	$\theta_{jk} = \frac{t_j - t_k}{T_s} \in [-1,1]$ 4	$\theta_{jk} = \frac{t_j - t_k}{T_s} \in [-1,1]$ 5 (MAE) (Sandino et al., 14 Nov 2025)

6. Limitations, Assumptions, and Implementation Notes

FPT methods demand careful handling of symmetry and negative similarities. Input similarity matrices must be symmetric; shifting may yield negative values, which are integrated naturally in the cost function without post-processing (Chehreghani, 2021). Global optimality is computationally infeasible (NP-hardness in clustering), but local minima are reachable efficiently and restarts can address sensitivity to initialization.

A plausible implication is that FPT’s global structure prediction bias offers natural robustness to unbalanced or incomplete data. However, some FPT applications are limited by the requirement of predefined window or patch size and position masking ratios, and may not generalize directly to all sequence modalities without adaptation or validation of shift statistics.

7. Synthesis and Relationships to Broader Methodologies

Frequency pretraining synthesizes two paradigms: explicit temporal or categorical relation prediction (as in self-supervised signal representation, joint alignment, and clustering-by-shift) and spectral, block-wise architectural lifting (enabling global recovery or balanced partitions). In clustering, it unifies min-cut, ratio-cut, and correlation clustering under a regularization-by-shift framework. In alignment tasks, spectral methods and projected power iterations enable provable and efficient global recovery in high-noise regimes. In sequence pretraining, regression to pairwise relative shifts enforces representations capturing high-level compositionality.

These advances illustrate FPT’s centrality in problems where the relational geometry—whether in time, label, or similarity space—is both the problem and the supervision. The methodology’s information-theoretic sharpness and empirical efficacy across self-supervised learning, combinatorial optimization, and clustering make it a robust tool for endowing models with an intrinsic notion of global structure (Sandino et al., 14 Nov 2025, Chen et al., 2016, Chehreghani, 2021).