GesFi: WiFi Gesture Recognition Framework

Updated 15 January 2026

GesFi is a WiFi-based gesture recognition framework employing latent domain mining to autonomously discover intrinsic domain factors from CSI data without relying on physical labels.
It integrates a denoising and visualization pipeline with iterative clustering and adversarial alignment to significantly improve cross-domain gesture recognition accuracy.
System deployment on commodity WiFi hardware across multiple datasets demonstrates performance gains up to +78% over traditional domain-adaptation methods.

GesFi is a WiFi-based gesture recognition framework that redefines domain generalization paradigms through the introduction of WiFi latent domain mining. Unlike preceding approaches that rely on physical labels such as user location or device orientation, GesFi autonomously discovers intrinsic domain factors responsible for distributional shifts in the Channel State Information (CSI) data, facilitating robust generalization in unseen target environments. The system integrates a denoising and visualization pipeline with iterative clustering and adversarial alignment to learn invariant features, achieving significant improvements in cross-domain recognition accuracy relative to prior methods (Zhang et al., 7 Jan 2026).

1. Principles of WiFi Latent Domain Mining

GesFi fundamentally departs from conventional domain adaptation by eschewing explicit physical-domain labels in favor of latent domains extracted from data statistics. The key premise is that physical labels (e.g., locations, orientations) poorly capture the nuanced factors that induce distribution changes in CSI during wireless gesture sensing. Instead, GesFi employs unsupervised clustering of learned feature representations, augmented with class-wise adversarial learning, to find latent domain factors more tightly coupled to real-world CSI variations. This mitigates two principal pitfalls: classification conflict (merging gesture classes with overlapping distributions) and manifold distortion (misalignment of distant domains distorting true CSI geometry).

2. Data Processing and Standardization Pipeline

The acquisition and preprocessing pipeline in GesFi ensures that raw WiFi CSI is transformed into high-fidelity input images suitable for deep learning:

CSI Ratio Denoising: Raw CSI at subcarrier $f$ , time $t$ is modeled as $R = HS + \mathcal{N}$ , with $H = H_s + H_d$ separating static and dynamic (gesture-induced) contributions. To suppress oscillator phase noise $\theta_n$ , CSI from adjacent antennas $H_1(f,t)$ and $H_2(f,t)$ are ratioed: $H_q(f,t) = H_1(f,t)/H_2(f,t)$ . For small $\Delta d = d_2 - d_1$ , $H_q$ approximates a Möbius transform of the true gesture phase, robustly filtering hardware noise.
Short-Time Fourier Transform (STFT) for Doppler Analysis: The instantaneous phase $\mathcal{P}(f,t) = \angle H_q(f, t)$ is high-pass filtered to remove static multipath. STFT is then computed to extract Doppler Frequency Shifts (DFS): $DFS(f, \omega) = STFT\{H_q(f, \cdot)\}(t, \omega)$ .
Visualization and Fusion: Heatmaps of $\mathcal{P}(f,t)$ and $DFS(f,\omega)$ for each antenna pair are concatenated into multi-channel images (resolution typically $224 \times 224$ ), providing a standardized input to a ResNet-18 backbone.

3. Latent Domain Discovery and Gesture Semantic Suppression

GesFi discovers $K$ latent domains via an iterative two-step scheme:

Pre-Learning Gesture Discrimination: A feature extractor $h_f$ and bottleneck $h_b^p$ are trained with a cross-entropy gesture classifier $h_c^p$ , minimizing $\mathcal{L}_{super} = \mathbb{E}_{(x,y_g)}[-y_g \cdot \log h_c^p(h_b^p(h_f(x)))]$ .
Pseudo-Labeling and Clustering: Domain centroids $\tilde{\mu}_k$ are initialized from softmax logits of a domain-classifier head $h_c^l$ . Samples are assigned domain labels $y_d$ by proximity in bottleneck space (Euclidean distance), and centroids are iteratively updated: $y_d = \arg \min_k D(h_b^l(h_f(x)), \mu_k)$ .
Class-wise Adversarial Learning: During clustering, a gradient-reversal layer $\mathcal{R}_{\lambda_1}$ and adversarial gesture classifier $h_{adv}^l$ are used. Minimizing $\mathcal{L}_{lad}$ while maximizing the reversed-gradient $\mathcal{L}_{adv}$ decouples gesture semantics from domain discrimination, preventing semantic conflict.

4. Adversarial Alignment for Robust Generalization

After clustering, GesFi aligns features across latent domains using adversarial domain discrimination:

Domain-Adversarial Loss: The feature extractor $h_f$ , bottleneck $h_b^d$ , and gesture classifier $h_c^d$ are trained to minimize

$\mathcal{L}_{ges} = \mathbb{E}_{(x,y_g)}[-y_g \cdot \log h_c^d(h_b^d(h_f(x)))]$

while simultaneously confusing the latent domain discriminator $h_{dadv}$ by maximizing its prediction loss with rebalanced class weights $w_d$ . Gradient reversal ( $\mathcal{R}_{\lambda_2}$ ) is used to force feature invariance to domain factors.

Training Strategy: Training alternates between updating the feature extractor/gesture classifier and maximizing domain discrimination, iterating latent mining and alignment for 10–20 epochs until convergence.

5. System Deployment and Evaluation

GesFi was implemented using commodity WiFi transceivers in multiple configurations:

Single-Pair Mode: 1 transmitter, 1 receiver (3 antennas) yield one antenna-pair CSI ratio.
Multi-Pair Mode: 1 transmitter, 3 receivers (each 3 antennas), producing six antenna pairs.
Datasets: Widar3.0, ARIL, XRF55, and real-world data (details in following table).

Dataset	Subjects/Env	Gestures	Total Samples	Hardware Config
Widar3.0	9 users/3 envs	6	12,750	Single/Multi
ARIL	1 user/16 pos	6	1,392	Single
XRF55	39 sub/4 scenes	8	6,240	Multi
Real-World	2 users, uncontrolled traffic	6	450	Multi

Training Protocol: ResNet-18 backbone, $K=3$ latent domains, Adam optimizer (lr $2\times 10^{-3}$ , batch 32, 50 epochs), pre-learning for 2 epochs, followed by alternating latent mining/adversarial alignment.

6. Benchmarking and Quantitative Performance

GesFi demonstrated substantial advances in cross-domain gesture recognition accuracy:

Widar3 (Multi-Pair): Achieved up to +78% improvement (cross-location/environments: 98.82% vs. 20.73% for baseline).
Widar3 (Single-Pair/ARIL): Outperformed MetaFormer/one-shot and Wi-Learner by up to 18% and 26% respectively.
XRF55: Cross-env accuracy 62.15% (vs. WiGRUNT 55.92%), cross-user 67.18% (vs. 63.47%).
Real-World Generalization: In-domain 62.89% (baseline 54.44%), cross-location 46.00% (baseline 38.89%), cross-orientation 42.89% (baseline 38.00%).
Training Data Sensitivity: Maintained $>$ 90% cross-location accuracy with only 20% Widar3 data. Greater source domain diversity (from 1 to 3) enhanced cross-orientation by +34%.

7. Contributions, Limitations, and Interpretive Insights

GesFi identified and systematically addressed two core limitations of conventional domain-adversarial approaches in WiFi gesture recognition:

Classification Conflict: Source error increases when adversarial alignment blends gesture classes with overlapping CSI.
Manifold Distortion: Physical domain-based alignment can misrepresent the geometry of the CSI manifold, impairing transfer performance.

By introducing WiFi latent domain mining—which iteratively clusters representations and erases gesture semantics during domain discovery—GesFi achieves tighter generalization bounds. This suggests that adversarial learning benefits substantially when domain factors are derived directly from CSI statistics rather than imposed through heuristic physical labels.

GesFi's implementation demonstrated consistent outperformance of state-of-the-art domain-adaptation frameworks without access to target-domain data. Its modular design supports both single-pair and multi-pair hardware; pipeline generality was shown across multiple public datasets and uncontrolled real-world settings.

A plausible implication is the broader applicability of latent domain mining strategies in other sensor-based domain generalization tasks, where physical labels are incomplete proxies for underlying distributional shifts (Zhang et al., 7 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Beyond Physical Labels: Redefining Domains for Robust WiFi-based Gesture Recognition (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GesFi.