GesFi: WiFi Gesture Recognition Framework
- GesFi is a WiFi-based gesture recognition framework employing latent domain mining to autonomously discover intrinsic domain factors from CSI data without relying on physical labels.
- It integrates a denoising and visualization pipeline with iterative clustering and adversarial alignment to significantly improve cross-domain gesture recognition accuracy.
- System deployment on commodity WiFi hardware across multiple datasets demonstrates performance gains up to +78% over traditional domain-adaptation methods.
GesFi is a WiFi-based gesture recognition framework that redefines domain generalization paradigms through the introduction of WiFi latent domain mining. Unlike preceding approaches that rely on physical labels such as user location or device orientation, GesFi autonomously discovers intrinsic domain factors responsible for distributional shifts in the Channel State Information (CSI) data, facilitating robust generalization in unseen target environments. The system integrates a denoising and visualization pipeline with iterative clustering and adversarial alignment to learn invariant features, achieving significant improvements in cross-domain recognition accuracy relative to prior methods (Zhang et al., 7 Jan 2026).
1. Principles of WiFi Latent Domain Mining
GesFi fundamentally departs from conventional domain adaptation by eschewing explicit physical-domain labels in favor of latent domains extracted from data statistics. The key premise is that physical labels (e.g., locations, orientations) poorly capture the nuanced factors that induce distribution changes in CSI during wireless gesture sensing. Instead, GesFi employs unsupervised clustering of learned feature representations, augmented with class-wise adversarial learning, to find latent domain factors more tightly coupled to real-world CSI variations. This mitigates two principal pitfalls: classification conflict (merging gesture classes with overlapping distributions) and manifold distortion (misalignment of distant domains distorting true CSI geometry).
2. Data Processing and Standardization Pipeline
The acquisition and preprocessing pipeline in GesFi ensures that raw WiFi CSI is transformed into high-fidelity input images suitable for deep learning:
- CSI Ratio Denoising: Raw CSI at subcarrier , time is modeled as , with separating static and dynamic (gesture-induced) contributions. To suppress oscillator phase noise , CSI from adjacent antennas and are ratioed: . For small , approximates a Möbius transform of the true gesture phase, robustly filtering hardware noise.
- Short-Time Fourier Transform (STFT) for Doppler Analysis: The instantaneous phase is high-pass filtered to remove static multipath. STFT is then computed to extract Doppler Frequency Shifts (DFS): .
- Visualization and Fusion: Heatmaps of and for each antenna pair are concatenated into multi-channel images (resolution typically ), providing a standardized input to a ResNet-18 backbone.
3. Latent Domain Discovery and Gesture Semantic Suppression
GesFi discovers latent domains via an iterative two-step scheme:
- Pre-Learning Gesture Discrimination: A feature extractor and bottleneck are trained with a cross-entropy gesture classifier , minimizing .
- Pseudo-Labeling and Clustering: Domain centroids are initialized from softmax logits of a domain-classifier head . Samples are assigned domain labels by proximity in bottleneck space (Euclidean distance), and centroids are iteratively updated: .
- Class-wise Adversarial Learning: During clustering, a gradient-reversal layer and adversarial gesture classifier are used. Minimizing while maximizing the reversed-gradient decouples gesture semantics from domain discrimination, preventing semantic conflict.
4. Adversarial Alignment for Robust Generalization
After clustering, GesFi aligns features across latent domains using adversarial domain discrimination:
- Domain-Adversarial Loss: The feature extractor , bottleneck , and gesture classifier are trained to minimize
while simultaneously confusing the latent domain discriminator by maximizing its prediction loss with rebalanced class weights . Gradient reversal () is used to force feature invariance to domain factors.
- Training Strategy: Training alternates between updating the feature extractor/gesture classifier and maximizing domain discrimination, iterating latent mining and alignment for 10–20 epochs until convergence.
5. System Deployment and Evaluation
GesFi was implemented using commodity WiFi transceivers in multiple configurations:
- Single-Pair Mode: 1 transmitter, 1 receiver (3 antennas) yield one antenna-pair CSI ratio.
- Multi-Pair Mode: 1 transmitter, 3 receivers (each 3 antennas), producing six antenna pairs.
- Datasets: Widar3.0, ARIL, XRF55, and real-world data (details in following table).
| Dataset | Subjects/Env | Gestures | Total Samples | Hardware Config |
|---|---|---|---|---|
| Widar3.0 | 9 users/3 envs | 6 | 12,750 | Single/Multi |
| ARIL | 1 user/16 pos | 6 | 1,392 | Single |
| XRF55 | 39 sub/4 scenes | 8 | 6,240 | Multi |
| Real-World | 2 users, uncontrolled traffic | 6 | 450 | Multi |
- Training Protocol: ResNet-18 backbone, latent domains, Adam optimizer (lr , batch 32, 50 epochs), pre-learning for 2 epochs, followed by alternating latent mining/adversarial alignment.
6. Benchmarking and Quantitative Performance
GesFi demonstrated substantial advances in cross-domain gesture recognition accuracy:
- Widar3 (Multi-Pair): Achieved up to +78% improvement (cross-location/environments: 98.82% vs. 20.73% for baseline).
- Widar3 (Single-Pair/ARIL): Outperformed MetaFormer/one-shot and Wi-Learner by up to 18% and 26% respectively.
- XRF55: Cross-env accuracy 62.15% (vs. WiGRUNT 55.92%), cross-user 67.18% (vs. 63.47%).
- Real-World Generalization: In-domain 62.89% (baseline 54.44%), cross-location 46.00% (baseline 38.89%), cross-orientation 42.89% (baseline 38.00%).
- Training Data Sensitivity: Maintained 90% cross-location accuracy with only 20% Widar3 data. Greater source domain diversity (from 1 to 3) enhanced cross-orientation by +34%.
7. Contributions, Limitations, and Interpretive Insights
GesFi identified and systematically addressed two core limitations of conventional domain-adversarial approaches in WiFi gesture recognition:
- Classification Conflict: Source error increases when adversarial alignment blends gesture classes with overlapping CSI.
- Manifold Distortion: Physical domain-based alignment can misrepresent the geometry of the CSI manifold, impairing transfer performance.
By introducing WiFi latent domain mining—which iteratively clusters representations and erases gesture semantics during domain discovery—GesFi achieves tighter generalization bounds. This suggests that adversarial learning benefits substantially when domain factors are derived directly from CSI statistics rather than imposed through heuristic physical labels.
GesFi's implementation demonstrated consistent outperformance of state-of-the-art domain-adaptation frameworks without access to target-domain data. Its modular design supports both single-pair and multi-pair hardware; pipeline generality was shown across multiple public datasets and uncontrolled real-world settings.
A plausible implication is the broader applicability of latent domain mining strategies in other sensor-based domain generalization tasks, where physical labels are incomplete proxies for underlying distributional shifts (Zhang et al., 7 Jan 2026).