SwEYEpinch: Exploring Intuitive, Efficient Text Entry for Extended Reality via Eye and Hand Tracking

Published 3 Apr 2026 in cs.HC | (2604.03520v1)

Abstract: Despite steady progress, text entry in Extended Reality (XR) often remains slower and more effortful than typing on a physical keyboard or touchscreen. We explore a simple idea: use gaze to swipe through a virtual keyboard for the fast, low-effort where and a manual pinch held throughout the swipe for the when, extending and validating it through a series of user studies. We first show that a basic version including a low-latency decoder with spatiotemporal Dynamic Time Warping and fixation filtering outperforms selecting individual keys sequentially, either by finger tapping each or gazing at each while pinching. We then add mid-swipe prediction and in-gesture cancellation, improving words per minute (WPM) without hurting accuracy. We show that this approach is faster and more preferred than previous gaze-swipe approaches, finger tapping with prediction, or hand swiping with the same additions. Furthermore, a seven-day, 30-session study demonstrates sustained learning, with peak performance reaching 64.7 WPM.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper presents a hybrid gaze–pinch paradigm that decouples spatial targeting from explicit commitment, yielding state-of-the-art text entry speeds in XR.
It employs advanced algorithms such as mid-swipe prediction, DTW-based alignment, and contextual language modeling to reduce latency and enhance candidate matching.
User studies demonstrate significant improvements in speed (up to 64.7 WPM), reduced workload, and robust skill transfer compared to traditional XR input methods.

SwEYEpinch: A Hybrid Gaze–Pinch Paradigm for High-Performance Text Entry in XR

Introduction

The persistent challenge of efficient text entry in extended reality (XR) environments significantly limits the utility and adoption of head-worn displays (HWDs). Classical input paradigms, such as controller-based raycasting, mid-air hand tracking, and external peripherals, offer at best moderate speeds and often place undue physical or situational constraints on users. Recent research has identified the potential of leveraging eye tracking for rapid, low-effort spatial targeting, but gaze-activated techniques alone are typically bottlenecked by dwell-time latency or unreliable activation, culminating in a fundamental speed cap and high error susceptibility.

This work introduces the SwEYEpinch paradigm (2604.03520), which explicitly divides input into a rapid "where" (gaze for continuous, low-effort target tracing across keys) and a deliberate "when" (a small, explicit manual pinch to delimit and confirm word entry). This hybridization is complemented by algorithmic advances—including mid-swipe prediction and low-friction error correction—enabling state-of-the-art speed and usability on commodity XR hardware.

System Overview and Methods

Interaction Design

SwEYEpinch operationalizes the principle of decoupling target selection from commitment, instantiated as follows:

Gaze-based Word Pathing: Users trace intended words across a virtual QWERTY keyboard using gaze. No dwell or confirmation is necessary per-character.
Pinch Delimiter: Users initiate a pinch gesture before starting their word-swipe and release to signal end-of-word. This explicit gesture disambiguates word boundaries without introducing dwell delays or dwell/Midas touch errors.
Mid-Swipe Prediction: Candidates are presented during swipes (not only post-commit), operationalized via a highly-optimized multi-stage decoder. This enables early commit and reduces unnecessary gaze traversal ("verification glances").
Mid-Swipe Deletion & Deletion Peek: Error correction is supported natively: users can abort ongoing swipes mid-gesture, and a contextual preview window reduces gaze shift for deletions.

Decoding Pipeline

The decoding backbone, Gaze2Word, applies a multi-step denoising and alignment regime essential for real-time, gaze-based swipe processing:

Fixation Filtering (Velocity-Threshold): High-frequency eye-tracking noise is attenuated through identification of fixations (e.g., using I-VT).
Density-Based Spatial Clustering (DBSCAN): Further condensation of fixation points to cluster centers reduces computational load (e.g., reduction from 320 to ~15 points per swipe).
Spatiotemporal Dynamic Time Warping (DTW): Candidate words are aligned using a distance function incorporating both spatial trajectory and temporal progression, improving discrimination and robustness to speed/length variation.
Contextual LLM (n-gram fusion): Language priors are adaptively fused with path-matching for candidate ranking, modulated to emphasize priors during underspecified (short or noisy) swipes.

The combined approach enables real-time (<3 ms) prediction cycles on commodity hardware—a requirement for perceptually-synchronous mid-swipe feedback.

Empirical Evaluation

User Study 1: Baseline Comparison

A five-session, within-subjects study (n=40) compared SwEYEpinch-Basic (no mid-swipe preview) to prevalent XR baselines: Finger-Tap and Gaze&Pinch (per-character, pinch-confirmed gaze taps).

Results:
- Speed: By session 5, SwEYEpinch-Basic outperformed baselines (22.5 WPM vs. 18.0/10.0 WPM for Finger-Tap/Gaze&Pinch).
- Error Rate: SwEYEpinch-Basic had a higher TER (5.2%) than baselines (3.5%/2.9%), attributable to lack of mid-swipe feedback.
- User Preference: Sits at the Pareto frontier of speed vs. preference, with lower workload than Gaze&Pinch.
This outcome validates the division of labor—using gaze for fast targeting and pinch for reliable commitment—while exposing the need for online candidate preview to mitigate error.

User Study 2: Mid-Swipe Prediction and Delimiter Analysis

Comparisons among SwEYEpinch (with mid-swipe prediction and in-gesture cancellation), SwEYEpinch-Basic, and XR-native gaze-only swipe methods (SkiMR, GlanceWriter XR) (n=21, 3 sessions):

Results:
- Mid-swipe feedback: SwEYEpinch provides a statistically significant gain in WPM over SwEYEpinch-Basic in every session (e.g., 22.3 vs. 16.6WPM by S3).
- Delimiter type: Pinch-delimited swipes are consistently faster (and more preferred) than purely gaze-delimited methods, without an error penalty (TER: 6.5–8.1% for pinch, 16.7% for SkiMR).
- Decoder Efficiency: SwEYEpinch’s decoder achieves higher top-1 candidate match than prefix-trie approaches (87.5% vs. 81.6%), facilitating more rapid commit/minimized trace lengths.

User Study 3: Strong Baseline Benchmarking and Skill Transfer

A more robust, three-session study (n=41) pits SwEYEpinch against production-realistic baselines: Finger-Tap with word prediction/completion, and Hand-Swipe (controller-free, mid-air hand gesturing).

Results:
- Peak Speed: SwEYEpinch achieves 26.2 WPM; significantly outpacing both (Finger-Tap w/ Pred.: ~17 WPM; Hand-Swipe: ~20 WPM).
- Learning Rate: SwEYEpinch exhibits the highest gains per session (5.1 WPM/session for novices), attributed to reduction in "verification/correction" overhead.
- Preference/Workload: SwEYEpinch consistently resides on the speed/preference Pareto frontier, and is at least as low-effort as alternatives (NASA TLX).
- Skill Transfer: Participants with prior SwEYEpinch experience exhibit an early advantage, suggesting strong transfer between variants with preserved "where/when" mapping.

User Study 4: Longitudinal Learning (30 sessions/7 days)

A week-long, 30-session evaluation (n=9) tracks acquisition and plateauing of SwEYEpinch skill.

Results:
- Sustained Learning: All participants show continuous improvement throughout, with per-user learning rates from 0.77 to 1.65 WPM/session.
- Attainable Speed: Five of nine users reach median speeds >54 WPM; three exceed 60 WPM, matching or exceeding typical desktop typing rates for non-developers.
- Learning Curve: Experts refine and consolidate, while XR novices accelerate more steeply late in the week, indicating only moderate dependence on XR familiarity.

Mechanistic and Algorithmic Insights

The key mechanism underlying SwEYEpinch's performance is the combination of continuous, high-tolerance gaze pathing with explicit, low-friction manual delimiting. The introduction of mid-swipe candidate preview enables users to truncate swipe traces as soon as their intended word surfaces, rather than full-word tracing, driving efficiency (higher characters per swipe point, shorter average paths). Error handling is also more fluid, with instant cancellation reducing commit errors, and the deletion peek window minimizing gaze travel.

Algorithmically, the pruning of raw gaze data and introduction of temporal alignment in DTW are critical: they reduce latency to the real-time regime and raise decoding accuracy compared to direct applications of spatial-only DTW or Fréchet distances as seen in prior work.

Implications and Future Directions

Practical Implications

SwEYEpinch is the first hybrid XR text entry system, to the authors’ knowledge, that reaches comparable speed to non-surface desktop keyboards for a substantial fraction of users, while requiring no external input devices beyond typical recent-generation HWD sensors (hand and eye tracking). It is silent, surface-free, and does not rely on voice—a major advantage in shared, mobile, or privacy-sensitive AR/VR scenarios.

Theoretical Implications

This work empirically establishes the benefit of decoupling spatial and temporal aspects of command specification in human–computer interaction, specifically in the context of high-dimensional, noisy input channels typical in XR. The results indicate that even a minimal explicit delimiter (small pinch) dramatically shifts the speed–effort–accuracy operating envelope for gaze-based systems.

Contradictory or Notable Claims

SwEYEpinch achieves speeds (up to 64.7 WPM) that overlap with desktop ordinary-user keyboard rates in a longitudinal setting, which many previous works considered infeasible for XR-native, surface-free text entry.
No absolute accuracy parity with enhanced finger-tap in some settings, yet error rates converge with sustained use, and in situ friction for correction is minimal for SwEYEpinch.
Transfer of skill is contingent on the mapping of spatial (where) and temporal (when) cues remaining stable across techniques.

Limitations

Open questions remain regarding long-duration use (potential eye fatigue), inclusivity across the full spectrum of user physiologies, and integration with free-form composition. The codebase and a comprehensive, user-level gaze-swipe XR dataset are released, facilitating community benchmarking and future work.

Conclusion

SwEYEpinch demonstrates that a hybrid gaze–hand interface, partitioning rapid target selection ("where") from explicit commitment ("when") and complemented by real-time mid-swipe feedback and low-effort error recovery, can fundamentally alter the landscape of XR text entry. The paradigm supports high-speed, low-fatigue, and learnable operation without additional hardware. Empirical evidence across four user studies shows consistently superior speed–effort trade-offs, sustained learning curves paralleling real-world typing, and a strong user preference profile. The design pattern—hybridization, live candidate presentation, optimized decoding, and minimal explicit gestures—should inform future XR user interface research and deployment.

Reference:

"SwEYEpinch: Exploring Intuitive, Efficient Text Entry for Extended Reality via Eye and Hand Tracking" (2604.03520)

Markdown Report Issue