Gaze-Only Student Model for Skill Assessment
- The gaze-only student model is a neural architecture that predicts human skill solely from gaze data, as demonstrated by the SkillSight-S framework.
- It integrates a token-based transformer encoder with specialized tokens for classification, action recognition, and knowledge distillation to capture spatiotemporal gaze features.
- The framework achieves a 73× energy reduction compared to video-based methods while maintaining competitive accuracy across diverse tasks.
A gaze-only student model denotes a neural architecture designed to predict human skill level from gaze data alone, as exemplified by the SkillSight-S framework developed for energy-efficient, first-person skill assessment. This approach leverages rich spatiotemporal representations of gaze signals to infer skill and action context, enabling real-time deployment on wearable devices while drastically reducing computational and power demands relative to video-centric models (Wu et al., 24 Nov 2025).
1. Input Representation and Normalization
SkillSight-S processes raw gaze signals extracted from egocentric recordings using smart glasses. For each clip of frames sampled at $2$ FPS (approx. 8 seconds), the gaze input comprises:
- 3D fixation point: Intersection of left and right eye rays in world coordinates.
- 3D gaze direction: Unit vector in camera-centric coordinates.
- 2D gaze projection : Normalized projection onto the egocentric RGB image plane.
- Gaze depth : Euclidean distance from head to fixation.
- Glasses pose: Quaternion (rotation) and translation, representing device orientation.
Normalization proceeds by subtracting the mean 3D fixation and rotating horizontally such that the initial gaze ray yields zero yaw. All vectors, translations, and rotations are expressed relative to the first frame, and spatial projections are scaled to , with depth standardized in meters. The normalized input at each time step is encoded as a -dimensional feature vector.
2. Transformer-Based Student Architecture
SkillSight-S utilizes a token-based transformer encoder that processes both temporal gaze vectors and auxiliary tokens. The input sequence consists of three special tokens and the normalized gaze vectors:
- : Skill-level classification token.
- : Distillation token for knowledge transfer.
- : Subtask/Action recognition token.
- : Sequential gaze vectors.
Each gaze vector is embedded via a learned linear projection into the model latent space of with . The sequence length is tokens. The transformer network comprises layers of multi-head self-attention and feedforward MLP blocks with GELU activations. Output heads deliver:
- : Predicted skill label.
- : Predicted subtask/action.
- : Student feature for feature matching.
3. Knowledge Distillation Strategy
SkillSight-S is trained using knowledge distillation, wherein the student model absorbs representations from a frozen, multimodal teacher superior in skill prediction. The teacher exposes a feature , combining video+gaze (), crop sequence (), and gaze trajectory (). Dedicated projections (teacher) and (student) align the respective embeddings.
The loss function comprises three terms:
- Skill classification (cross-entropy): .
- Action/subtask classification (cross-entropy): .
- Distillation (L1 feature matching): .
The overall objective is , with weights determined on a held-out validation set. No additional weight decay or temperature scaling is reported beyond standard AdamW regularization.
4. Training Protocol and Hyperparameters
Relevant hyperparameters are summarized below:
| Component | Value | Details |
|---|---|---|
| Sequence length | (2 FPS) | ≈8 seconds per clip |
| Hidden size () | $768$ | Per TimeSformer/ViT |
| Transformer layers | 12 attention heads | |
| Teacher optimizer | SGD, lr= | batch=8, 15 epochs |
| Student optimizer | AdamW, lr= | batch=32, 10 epochs |
| Pretrained components | TimeSformer (EgoVLPv2), Dinov2 | f_V, f_I, f_T, f_m, f_s, f_g |
Auxiliary loss ensures the student learns action context via the subtask token. The transformer configuration inherits design choices from TimeSformer and ViT architectures.
5. Quantitative Evaluation and Ablation Analysis
SkillSight-S is benchmarked on Ego-Exo4D (covering cooking, music, soccer, basketball, bouldering) and MS Badminton datasets. Accuracy metrics for skill classification are as follows (SkillSight-T: teacher, SkillSight-S: student):
| Task | SkillSight-T | SkillSight-S |
|---|---|---|
| Cooking | 58.5% | 47.2% |
| Music | 50.0% | 52.8% |
| Sports (avg) | 55.2% | 51.9% |
| Soccer (expert-novice) | 73.3% | 72.6% |
Ablation results from Supplement Table-5 show incremental improvements:
| Configuration | Accuracy | Change |
|---|---|---|
| Base gaze-only (no distill/action) | 37.0% | — |
| + Distillation only | 40.0% | +3.0 pts |
| + Action recognition only | 40.7% | +3.7 pts |
| + Both (full model) | 44.4% | +7.4 pts |
Both distillation token and auxiliary action classification contribute materially to final skill-level accuracy.
6. Power Consumption and Efficiency
SkillSight-S furnishes an approximately reduction in energy consumption relative to full video-based models, as validated via a PyTorch-based flop and memory profiler:
Where:
- = MACs (multiply-accumulate operations) per inference pass ($1$ MAC FLOPs)
- = bytes transferred
- = sensor power ($35$ mW RGB camera, $7.8$ mW eye tracker)
- pJ/MAC, pJ/byte, seconds (per clip)
Empirical measurements yield mW for TimeSformer (video-only) versus $9.5$ mW for SkillSight-S, i.e., a energy reduction via gaze-only inference.
7. Implications and Applications
SkillSight-S realizes real-time skill assessment solely from gaze captured by wearable smart glasses, circumventing energy- and computation-intensive video processing. This framework establishes, for the first time, that gaze trajectory encodes sufficient information for accurate skill and subtask prediction across domains such as cooking, music, and sports. A plausible implication is the enabling of scalable, unobtrusive AI-assisted feedback for skill acquisition in naturalistic settings. The hybrid training with joint gaze-video teachers and distillation promotes transfer and generalization to new tasks without compromising power constraints (Wu et al., 24 Nov 2025).