WirelessJEPA: Self-Supervised Wireless Model
- WirelessJEPA is a self-supervised wireless foundation model that leverages masked latent prediction to learn general-purpose spatio-temporal representations from raw multi-antenna IQ data.
- It introduces a structured 2D antenna–time representation with novel masking strategies to create inductive biases for diverse wireless tasks.
- Empirical evaluations demonstrate that WirelessJEPA outperforms contrastive and reconstruction-based paradigms in tasks such as modulation classification and angle-of-arrival estimation.
WirelessJEPA is a self-supervised wireless foundation model based on the Joint Embedding Predictive Architecture (JEPA), designed to learn general-purpose spatio-temporal representations from raw multi-antenna in-phase/quadrature (IQ) data. Unlike prior contrastive or reconstruction-based paradigms, WirelessJEPA leverages masked latent prediction to construct transferable embeddings, enabling robust performance across diverse downstream wireless tasks. The model introduces a structured 2D antenna–time input representation, novel mask geometries for inductive bias, and joint encoder–predictor training, forming the basis for a new class of adaptable wireless foundation models (Chu et al., 28 Jan 2026, Chaaya et al., 2024).
1. Core Principles and Architectural Overview
WirelessJEPA adapts the JEPA framework to the wireless context by focusing on latent feature prediction rather than reconstruction of raw IQ samples or contrastive instance discrimination. The central aim is to infer the latent codes of masked spatial–temporal regions of a structured 2D input, using only partial observations of the signal. Specifically, three neural components are used:
- A student encoder, , processes masked 2D antenna–time inputs;
- A lightweight stack of separable convolutions, , predicts latent codes at the masked positions;
- A momentum teacher encoder, , with parameters updated by exponential moving average, provides stable target representations.
The learning objective minimizes the L2 distance between the predicted and teacher latent codes over masked regions:
where indexes the masked locations, are predicted latents, and are the target latents (Chu et al., 28 Jan 2026).
In the related domain of latent wireless dynamics modeling, a variant called Wireless-JEPA (“W-JEPA”) uses high-dimensional OFDM Channel State Information (CSI) as input. W-JEPA learns both a low-dimensional “pseudo-location” embedding via a multilayer perceptron (MLP) encoder and autoregressively models the temporal evolution of these embeddings conditioned on user velocities using a GRU (Chaaya et al., 2024). This architecture forms a self-supervised “world model” capable of simulating channel dynamics in latent space.
2. Antenna–Time Representation and Masking Strategies
WirelessJEPA operates on raw IQ inputs , where is the number of antennas (e.g., ) and the temporal window length (e.g., ). The antenna axis is up-sampled via nearest-neighbor interpolation to form a square grid :
This 2D grid structure enables the application of block-wise spatial–temporal mask geometries, allowing flexible selection of information to occlude.
Four primary masking strategies are utilized, each introducing different inductive biases:
- Random masks: Scattered non-overlapping blocks, focusing on general context inference.
- Antenna masks: Occlude entire antenna rows ( blocks), emphasizing spatial inference across antennas.
- Time masks: Occlude temporal columns (), stressing temporal continuity and prediction.
- Multi-block masks: Group blocks into spatio-temporal regions, balancing spatial and temporal bias.
A binary mask identifies occluded positions () for loss computation. These geometries directly steer what neighborhood relationships must be inferred by the encoder (Chu et al., 28 Jan 2026).
3. Model Backbones, Training Paradigm, and Loss
The encoder backbone is based on ShuffleNetV2-0.5, accepting the 2-channel input grid. Sparse convolutional operations are implemented by zeroing activations at masked positions after each layer, and inserting a learned mask token at those indices in the latent grid. The predictor consists of three depthwise separable convolution layers with BatchNorm and ReLU, outputting predicted latent codes only at masked grid positions.
The teacher encoder is isomorphic to , run densely on unmasked inputs, and its weights are updated by exponential moving average:
The L2 objective (as described above) restricts error computation to masked positions, ensuring the encoder–predictor pair can infer hidden information from context alone (Chu et al., 28 Jan 2026).
In contrast, W-JEPA for CSI leverages a five-layer MLP encoder (hidden widths with ReLU activations, projecting to ) and a GRU-based predictor, trained via mean-squared error loss in the latent space over a horizon :
where are target latents produced by an EMA target encoder (Chaaya et al., 2024).
4. Channel Charting and Latent Dynamics
WirelessJEPA's predictive latent masking approach enables the learning of structured, general-purpose representations suitable for downstream classification and regression tasks. The pseudo-location embeddings in W-JEPA are designed to preserve spatial topology (channel charting), such that embeddings of similar CSI estimates are mapped to nearby points in latent space, reflecting user geometry. This is achieved implicitly by requiring the encoder to produce locally and globally predictable embeddings under realistic user velocities (Chaaya et al., 2024).
Pre-training the encoder with domain-specific charting losses (e.g., angle–delay profile-based distance or channel-geodesic ) can further improve metrics such as continuity (CT), trustworthiness (TW), Kruskal stress (KS), and Rajski distance (RD), sharpening both local and global geometric fidelity of learned charts.
5. Empirical Performance and Inductive Bias Analysis
Extensive evaluation demonstrates that the inductive bias introduced by mask shape significantly impacts downstream task performance. For modulation classification with time masks, WirelessJEPA achieves 80.8% accuracy on in-distribution linear probe, surpassing the contrastive IQFM baseline (60.5%). For angle-of-arrival (AoA) tasks, antenna masks achieve 40.4%, exceeding time (2.7%) and random (12.6%) masks. Multi-block masks strike a moderate balance (modulation 73.1%, AoA 5.8%) (Chu et al., 28 Jan 2026).
Out-of-distribution transfer shows WirelessJEPA achieves up to 9 percentage points higher accuracy than contrastive baselines on tasks such as RF fingerprinting, modulation recognition (RML2016.10a), and 5G NR interference classification. The latent space learned is confirmed to be structured: k-NN classification closely matches or slightly trails linear probe performance in domain-matched tasks.
For the CSI-based W-JEPA, region classification accuracy remains nearly constant (87%) over a long prediction horizon (), representing a more than twofold improvement over a greedy region prediction baseline (43% at ). GRU and LSTM predictors show comparable performance, with simple RNNs degrading at longer horizons. Prediction accuracy is robust to significant velocity noise for moderate horizons, though performance drops at extreme biases (Chaaya et al., 2024).
| Mask Geometry | In-domain Modulation | In-domain AoA |
|---|---|---|
| Time | 80.8% | 2.7% |
| Antenna | 48.8% | 40.4% |
| Multi-block | 73.1% | 5.8% |
| Random | 54.0% | 12.6% |
| Contrastive | 60.5% | 32.4% |
6. Practicality, Generalization, and Extensions
WirelessJEPA models are computationally efficient, with sparse convolutional inference and predictor overheads that fit within modern edge GPU and baseband processor budgets. No task-specific labels are required for foundation pre-training, allowing direct adaptation to new deployment conditions, antenna configurations, or channel environments. W-JEPA's self-supervised loop admits retraining using only unlabeled CSI and auxiliary sensor logs.
Potential extensions include:
- Hybrid dynamic mask geometries that adaptively combine spatial and temporal occlusions;
- Scaling the method to massive MIMO and extended temporal windows;
- Integrating physics-informed priors, such as channel state models, in the loss or prediction process;
- Applying JEPA-style pre-training in federated or distributed learning topologies.
A plausible implication is that such extensions could further boost both robustness and data-efficiency in transfer to novel wireless tasks.
7. Contributions, Open Directions, and Impact
WirelessJEPA establishes masked latent prediction as an effective unified pre-training paradigm for raw multi-antenna wireless signals, achieving robust cross-task generalization with a single encoder. Key contributions include the antenna–time grid for convolutional JEPA adaptation, spatio-temporal mask geometries for explicit inductive bias control, and empirical demonstration of masked latent prediction outperforming leading contrastive baselines across in-domain and transfer tasks (Chu et al., 28 Jan 2026, Chaaya et al., 2024). Future research is anticipated to explore hybrid masking, scaling, and physics-guided variants, as well as large-scale distributed wireless model pre-training. The methodology provides a foundation for constructing generalizable wireless foundation models supporting a broad array of downstream applications.