Gaze-Only Student Model for Skill Assessment

Updated 27 November 2025

The gaze-only student model is a neural architecture that predicts human skill solely from gaze data, as demonstrated by the SkillSight-S framework.
It integrates a token-based transformer encoder with specialized tokens for classification, action recognition, and knowledge distillation to capture spatiotemporal gaze features.
The framework achieves a 73× energy reduction compared to video-based methods while maintaining competitive accuracy across diverse tasks.

A gaze-only student model denotes a neural architecture designed to predict human skill level from gaze data alone, as exemplified by the SkillSight-S framework developed for energy-efficient, first-person skill assessment. This approach leverages rich spatiotemporal representations of gaze signals to infer skill and action context, enabling real-time deployment on wearable devices while drastically reducing computational and power demands relative to video-centric models (Wu et al., 24 Nov 2025).

1. Input Representation and Normalization

SkillSight-S processes raw gaze signals $G_i$ extracted from egocentric recordings using smart glasses. For each clip of $T=16$ frames sampled at $2$ FPS (approx. 8 seconds), the gaze input comprises:

3D fixation point: Intersection of left and right eye rays in world coordinates.
3D gaze direction: Unit vector in camera-centric coordinates.
2D gaze projection $g_{2d} \in [0,1]^2$ : Normalized projection onto the egocentric RGB image plane.
Gaze depth $d$ : Euclidean distance from head to fixation.
Glasses pose: Quaternion (rotation) and $(x, y, z)$ translation, representing device orientation.

Normalization proceeds by subtracting the mean 3D fixation and rotating horizontally such that the initial gaze ray yields zero yaw. All vectors, translations, and rotations are expressed relative to the first frame, and spatial projections are scaled to $[0,1]$ , with depth standardized in meters. The normalized input at each time step $t$ is encoded as a $D_g \approx 16$ -dimensional feature vector.

2. Transformer-Based Student Architecture

SkillSight-S utilizes a token-based transformer encoder $f_s$ that processes both temporal gaze vectors and auxiliary tokens. The input sequence consists of three special tokens and the $T$ normalized gaze vectors:

$t_{cls} \in \mathbb{R}^d$ : Skill-level classification token.
$t_{dis} \in \mathbb{R}^d$ : Distillation token for knowledge transfer.
$t_{act} \in \mathbb{R}^d$ : Subtask/Action recognition token.
$G = [g^1, g^2, ..., g^T]$ : Sequential gaze vectors.

Each gaze vector $g^t \in \mathbb{R}^{D_g}$ is embedded via a learned linear projection into the model latent space of $\mathbb{R}^d$ with $d = 768$ . The sequence length is $T+3$ tokens. The transformer network comprises $L=4$ layers of multi-head self-attention and feedforward MLP blocks with GELU activations. Output heads deliver:

$\hat{S} = \text{classifier}(t_{cls})$ : Predicted skill label.
$\hat{a} = \text{classifier}(t_{act})$ : Predicted subtask/action.
$\hat{e}_s = \text{projection}(t_{dis})$ : Student feature for feature matching.

3. Knowledge Distillation Strategy

SkillSight-S is trained using knowledge distillation, wherein the student model absorbs representations from a frozen, multimodal teacher superior in skill prediction. The teacher exposes a feature $e_T = [e_v; e_c; e_g] \in \mathbb{R}^{3 \times 768}$ , combining video+gaze ( $e_v$ ), crop sequence ( $e_c$ ), and gaze trajectory ( $e_g$ ). Dedicated projections $f_t$ (teacher) and $f_p$ (student) align the respective embeddings.

The loss function comprises three terms:

Skill classification (cross-entropy): $L_{CE} = -\sum_{k=1}^K y_k \log p_k(\hat S)$ .
Action/subtask classification (cross-entropy): $L_{act} = -\sum_{j=1}^J a_j \log q_j(\hat a)$ .
Distillation (L1 feature matching): $L_{distill} = \| f_p(\hat e_s) - f_t(e_T) \|_1$ .

The overall objective is $L_{student} = L_{CE} + \lambda_{act} L_{act} + \lambda_{dis} L_{distill}$ , with weights $\lambda_{act}, \lambda_{dis}$ determined on a held-out validation set. No additional weight decay or temperature scaling is reported beyond standard AdamW regularization.

4. Training Protocol and Hyperparameters

Relevant hyperparameters are summarized below:

Component	Value	Details
Sequence length	$T=16$ (2 FPS)	≈8 seconds per clip
Hidden size ( $d$ )	$768$	Per TimeSformer/ViT
Transformer layers	$L=4$	12 attention heads
Teacher optimizer	SGD, lr= $5\times10^{-3}$	batch=8, 15 epochs
Student optimizer	AdamW, lr= $1\times10^{-4}$	batch=32, 10 epochs
Pretrained components	TimeSformer (EgoVLPv2), Dinov2	f_V, f_I, f_T, f_m, f_s, f_g

Auxiliary loss $L_{act}$ ensures the student learns action context via the subtask token. The transformer configuration inherits design choices from TimeSformer and ViT architectures.

5. Quantitative Evaluation and Ablation Analysis

SkillSight-S is benchmarked on Ego-Exo4D (covering cooking, music, soccer, basketball, bouldering) and MS Badminton datasets. Accuracy metrics for skill classification are as follows (SkillSight-T: teacher, SkillSight-S: student):

Task	SkillSight-T	SkillSight-S
Cooking	58.5%	47.2%
Music	50.0%	52.8%
Sports (avg)	55.2%	51.9%
Soccer (expert-novice)	73.3%	72.6%

Ablation results from Supplement Table-5 show incremental improvements:

Configuration	Accuracy	Change
Base gaze-only (no distill/action)	37.0%	—
+ Distillation only	40.0%	+3.0 pts
+ Action recognition only	40.7%	+3.7 pts
+ Both (full model)	44.4%	+7.4 pts

Both distillation token and auxiliary action classification contribute materially to final skill-level accuracy.

6. Power Consumption and Efficiency

SkillSight-S furnishes an approximately $73\times$ reduction in energy consumption relative to full video-based models, as validated via a PyTorch-based flop and memory profiler:

$P = \alpha \frac{N}{\Delta t} + \beta \frac{B}{\Delta t} + \sum_m \gamma_m \delta_m$

Where:

$N$ = MACs (multiply-accumulate operations) per inference pass ($1$ MAC $=2$ FLOPs)
$B$ = bytes transferred
$\gamma_m$ = sensor power ($35$ mW RGB camera, $7.8$ mW eye tracker)
$\alpha = 4.6$ pJ/MAC, $\beta = 80$ pJ/byte, $\Delta t = 0.5$ seconds (per clip)

Empirical measurements yield $P \approx 697.5$ mW for TimeSformer (video-only) versus $9.5$ mW for SkillSight-S, i.e., a $73\times$ energy reduction via gaze-only inference.

7. Implications and Applications

SkillSight-S realizes real-time skill assessment solely from gaze captured by wearable smart glasses, circumventing energy- and computation-intensive video processing. This framework establishes, for the first time, that gaze trajectory encodes sufficient information for accurate skill and subtask prediction across domains such as cooking, music, and sports. A plausible implication is the enabling of scalable, unobtrusive AI-assisted feedback for skill acquisition in naturalistic settings. The hybrid training with joint gaze-video teachers and distillation promotes transfer and generalization to new tasks without compromising power constraints (Wu et al., 24 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

SkillSight: Efficient First-Person Skill Assessment with Gaze (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Gaze-Only Student Model.