Papers
Topics
Authors
Recent
2000 character limit reached

Gaze-Only Student Model for Skill Assessment

Updated 27 November 2025
  • The gaze-only student model is a neural architecture that predicts human skill solely from gaze data, as demonstrated by the SkillSight-S framework.
  • It integrates a token-based transformer encoder with specialized tokens for classification, action recognition, and knowledge distillation to capture spatiotemporal gaze features.
  • The framework achieves a 73× energy reduction compared to video-based methods while maintaining competitive accuracy across diverse tasks.

A gaze-only student model denotes a neural architecture designed to predict human skill level from gaze data alone, as exemplified by the SkillSight-S framework developed for energy-efficient, first-person skill assessment. This approach leverages rich spatiotemporal representations of gaze signals to infer skill and action context, enabling real-time deployment on wearable devices while drastically reducing computational and power demands relative to video-centric models (Wu et al., 24 Nov 2025).

1. Input Representation and Normalization

SkillSight-S processes raw gaze signals GiG_i extracted from egocentric recordings using smart glasses. For each clip of T=16T=16 frames sampled at $2$ FPS (approx. 8 seconds), the gaze input comprises:

  • 3D fixation point: Intersection of left and right eye rays in world coordinates.
  • 3D gaze direction: Unit vector in camera-centric coordinates.
  • 2D gaze projection g2d[0,1]2g_{2d} \in [0,1]^2: Normalized projection onto the egocentric RGB image plane.
  • Gaze depth dd: Euclidean distance from head to fixation.
  • Glasses pose: Quaternion (rotation) and (x,y,z)(x, y, z) translation, representing device orientation.

Normalization proceeds by subtracting the mean 3D fixation and rotating horizontally such that the initial gaze ray yields zero yaw. All vectors, translations, and rotations are expressed relative to the first frame, and spatial projections are scaled to [0,1][0,1], with depth standardized in meters. The normalized input at each time step tt is encoded as a Dg16D_g \approx 16-dimensional feature vector.

2. Transformer-Based Student Architecture

SkillSight-S utilizes a token-based transformer encoder fsf_s that processes both temporal gaze vectors and auxiliary tokens. The input sequence consists of three special tokens and the TT normalized gaze vectors:

  • tclsRdt_{cls} \in \mathbb{R}^d: Skill-level classification token.
  • tdisRdt_{dis} \in \mathbb{R}^d: Distillation token for knowledge transfer.
  • tactRdt_{act} \in \mathbb{R}^d: Subtask/Action recognition token.
  • G=[g1,g2,...,gT]G = [g^1, g^2, ..., g^T]: Sequential gaze vectors.

Each gaze vector gtRDgg^t \in \mathbb{R}^{D_g} is embedded via a learned linear projection into the model latent space of Rd\mathbb{R}^d with d=768d = 768. The sequence length is T+3T+3 tokens. The transformer network comprises L=4L=4 layers of multi-head self-attention and feedforward MLP blocks with GELU activations. Output heads deliver:

  • S^=classifier(tcls)\hat{S} = \text{classifier}(t_{cls}): Predicted skill label.
  • a^=classifier(tact)\hat{a} = \text{classifier}(t_{act}): Predicted subtask/action.
  • e^s=projection(tdis)\hat{e}_s = \text{projection}(t_{dis}): Student feature for feature matching.

3. Knowledge Distillation Strategy

SkillSight-S is trained using knowledge distillation, wherein the student model absorbs representations from a frozen, multimodal teacher superior in skill prediction. The teacher exposes a feature eT=[ev;ec;eg]R3×768e_T = [e_v; e_c; e_g] \in \mathbb{R}^{3 \times 768}, combining video+gaze (eve_v), crop sequence (ece_c), and gaze trajectory (ege_g). Dedicated projections ftf_t (teacher) and fpf_p (student) align the respective embeddings.

The loss function comprises three terms:

  • Skill classification (cross-entropy): LCE=k=1Kyklogpk(S^)L_{CE} = -\sum_{k=1}^K y_k \log p_k(\hat S).
  • Action/subtask classification (cross-entropy): Lact=j=1Jajlogqj(a^)L_{act} = -\sum_{j=1}^J a_j \log q_j(\hat a).
  • Distillation (L1 feature matching): Ldistill=fp(e^s)ft(eT)1L_{distill} = \| f_p(\hat e_s) - f_t(e_T) \|_1.

The overall objective is Lstudent=LCE+λactLact+λdisLdistillL_{student} = L_{CE} + \lambda_{act} L_{act} + \lambda_{dis} L_{distill}, with weights λact,λdis\lambda_{act}, \lambda_{dis} determined on a held-out validation set. No additional weight decay or temperature scaling is reported beyond standard AdamW regularization.

4. Training Protocol and Hyperparameters

Relevant hyperparameters are summarized below:

Component Value Details
Sequence length T=16T=16 (2 FPS) ≈8 seconds per clip
Hidden size (dd) $768$ Per TimeSformer/ViT
Transformer layers L=4L=4 12 attention heads
Teacher optimizer SGD, lr=5×1035\times10^{-3} batch=8, 15 epochs
Student optimizer AdamW, lr=1×1041\times10^{-4} batch=32, 10 epochs
Pretrained components TimeSformer (EgoVLPv2), Dinov2 f_V, f_I, f_T, f_m, f_s, f_g

Auxiliary loss LactL_{act} ensures the student learns action context via the subtask token. The transformer configuration inherits design choices from TimeSformer and ViT architectures.

5. Quantitative Evaluation and Ablation Analysis

SkillSight-S is benchmarked on Ego-Exo4D (covering cooking, music, soccer, basketball, bouldering) and MS Badminton datasets. Accuracy metrics for skill classification are as follows (SkillSight-T: teacher, SkillSight-S: student):

Task SkillSight-T SkillSight-S
Cooking 58.5% 47.2%
Music 50.0% 52.8%
Sports (avg) 55.2% 51.9%
Soccer (expert-novice) 73.3% 72.6%

Ablation results from Supplement Table-5 show incremental improvements:

Configuration Accuracy Change
Base gaze-only (no distill/action) 37.0%
+ Distillation only 40.0% +3.0 pts
+ Action recognition only 40.7% +3.7 pts
+ Both (full model) 44.4% +7.4 pts

Both distillation token and auxiliary action classification contribute materially to final skill-level accuracy.

6. Power Consumption and Efficiency

SkillSight-S furnishes an approximately 73×73\times reduction in energy consumption relative to full video-based models, as validated via a PyTorch-based flop and memory profiler:

P=αNΔt+βBΔt+mγmδmP = \alpha \frac{N}{\Delta t} + \beta \frac{B}{\Delta t} + \sum_m \gamma_m \delta_m

Where:

  • NN = MACs (multiply-accumulate operations) per inference pass ($1$ MAC =2=2 FLOPs)
  • BB = bytes transferred
  • γm\gamma_m = sensor power ($35$ mW RGB camera, $7.8$ mW eye tracker)
  • α=4.6\alpha = 4.6 pJ/MAC, β=80\beta = 80 pJ/byte, Δt=0.5\Delta t = 0.5 seconds (per clip)

Empirical measurements yield P697.5P \approx 697.5 mW for TimeSformer (video-only) versus $9.5$ mW for SkillSight-S, i.e., a 73×73\times energy reduction via gaze-only inference.

7. Implications and Applications

SkillSight-S realizes real-time skill assessment solely from gaze captured by wearable smart glasses, circumventing energy- and computation-intensive video processing. This framework establishes, for the first time, that gaze trajectory encodes sufficient information for accurate skill and subtask prediction across domains such as cooking, music, and sports. A plausible implication is the enabling of scalable, unobtrusive AI-assisted feedback for skill acquisition in naturalistic settings. The hybrid training with joint gaze-video teachers and distillation promotes transfer and generalization to new tasks without compromising power constraints (Wu et al., 24 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Gaze-Only Student Model.