Papers
Topics
Authors
Recent
Search
2000 character limit reached

Interpersonal Eye-Gaze Encoder (ICE)

Updated 3 May 2026
  • ICE is a computational framework that estimates a conversational partner’s eye gaze by dynamically clustering gaze vectors from standard RGB video without special hardware or calibration.
  • It operationalizes gaze into a 3×3 grid by identifying the region of primary visual engagement, achieving an F1 score of 0.846 and significant expert correlation.
  • ICE demonstrates practical applications in deception detection and social skill assessment, offering actionable insights into nonverbal communication.

The Interpersonal-Calibrating Eye-gaze Encoder (ICE) is a computational framework designed to estimate interpersonal eye gaze—that is, gaze directed at a conversational partner—directly from conventional RGB video without requiring specialized hardware, camera calibration, or prior physical layout information. ICE leverages the empirical observation that, during dyadic conversations, participants allocate most of their gaze toward one another’s faces. The key technical innovation is a dynamic clustering approach that discovers the partner’s location in the subject’s gaze-space purely from video-derived gaze data. ICE discretizes gaze into a 3×3 interpersonal grid referenced to the empirically discovered region of primary visual engagement (RVE), thus operationalizing “does-subject-look-at-partner?” for each frame. The framework has been validated against both objective (infrared eye-tracking, F1=0.846F_1=0.846) and subjective (expert eye-contact ratings, r=0.37r=0.37) ground truth, and has demonstrated novel behavioral insights in domains such as deception detection and social skill assessment (Tran et al., 2019).

1. Motivation and Conceptual Architecture

Traditional video-based gaze tracking yields gaze vectors relative to the camera’s coordinate frame. For applications requiring the estimation of interpersonal gaze—such as quantifying eye contact—conversion to partner-relative coordinates typically demands either (a) instrumentation with infrared trackers or (b) knowledge of the physical setup (e.g., camera and screen location, seating arrangement). ICE eliminates these requirements by positing that the densest cluster of a participant’s gaze vectors over a session corresponds to their conversational partner’s face. ICE processes conventional videos, requiring no manual annotation or calibration, and outputting, for every frame, a label indicating which of nine discrete interpersonal regions the gaze falls into, with the RVE defined as region 5 (center). The high-level pipeline comprises four stages:

  1. Face and eye detection with raw gaze angle extraction.
  2. Assembly of a temporally ordered series of gaze points in angle space.
  3. Density-based dynamic clustering to discover the RVE.
  4. Re-normalization of gaze into a standardized 3×3 interpersonal region grid.

2. Video-to-Gaze Pipeline

The ICE processing pipeline algorithmically transforms RGB video into interpersonal gaze labels by the following sequence:

  • Face and Eye Detection & Gaze Extraction: For each video frame tt, OpenFace 2.0 is employed to detect facial landmarks and estimate the gaze vector as (gx(t),gy(t))(g_x(t), g_y(t)), which encodes eye-in-head angles (in radians) relative to the camera. Frames with OpenFace confidence <0.9<0.9 are discarded for reliability.
  • Downsampling and Smoothing: To mitigate noise inherent in frame-level gaze estimation, the temporal sequence {(gx(t),gy(t))}\{(g_x(t), g_y(t))\} is downsampled (e.g., to 3 fps) by majority vote over blocks, yielding a more robust set of gaze points.
  • Assembly of Gaze Point Cloud: The resulting collection of NN two-dimensional gaze angles, P={pi=(gx(i),gy(i)):i=1,,N}P = \{p_i = (g_x(i), g_y(i)): i=1,\ldots,N\}, constitutes the input “cloud” for clustering.

This processed dataset forms the basis for unsupervised identification of gaze regions corresponding to the conversational partner.

3. Dynamic Clustering Algorithm: Mathematical Formulation and Pseudocode

The core of ICE is a dynamic, unsupervised clustering routine based on DBSCAN (Density-Based Spatial Clustering of Applications with Noise) [Ester et al. 1996]. This approach models the distribution of gaze points in the angular plane and automatically finds dense regions without a priori knowledge. The algorithm parameters are dynamically determined:

  • Neighborhood radius (ϵ\epsilon): Searched over [0,1][0,1] radians in decrements of r=0.37r=0.370.
  • Minimum points (r=0.37r=0.371): Set to r=0.37r=0.372, i.e., 1% of the total number of frames.

Clusters are identified using the standard DBSCAN definitions:

  • r=0.37r=0.373-Neighborhood: r=0.37r=0.374.
  • Core Point: r=0.37r=0.375 is a core point iff r=0.37r=0.376.
  • Clusters r=0.37r=0.377 are extracted; others are labeled as noise.

Parameter Selection: The optimal r=0.37r=0.378 is selected such that the number of clusters is at least two, the largest and second largest clusters satisfy r=0.37r=0.379, and tt0 is not noise. If no tt1 meets these, fallback tt2 is used.

Pseudocode for ICE Calibration:

{(gx(t),gy(t))}\{(g_x(t), g_y(t))\}0 Each gaze frame receives an interpersonal region label tt3 indicating gaze as left/right/center/up/down and diagonals relative to the empirically estimated RVE.

4. Validation Metrics

ICE is quantitatively validated against ground truth using two principal metrics:

  • Video Chat (Objective): Precision, Recall, and tt4-score between ICE's RVE-detection and an infrared gaze tracker (Gazepoint GP3). The core task is binary: “Is gaze in RVE?”.
  • Face-to-Face (Subjective): Pearson correlation tt5 between ICE's estimated proportion of RVE-directed gaze and human expert ratings of eye contact (six-point scale).

The following table summarizes these metrics:

Context Sample Size (tt6) Metric Value
Video Chat 8 tt7 0.846
Face-to-Face 166 Pearson tt8 0.37

The high tt9 and significant correlation coefficients demonstrate both objective and subjective validity of ICE in diverse conversational settings.

5. Experimental Setup and Results

Video Chat Validation:

Dyadic conversations were recorded (15 fps) while participants simultaneously wore a Gazepoint GP3 IR tracker (60 Hz). Data streams were temporally aligned via cross-correlation. ICE region assignments (3 fps, downsampled) were compared framewise to IR tracker results; mean accuracy was (gx(t),gy(t))(g_x(t), g_y(t))0 ((gx(t),gy(t))(g_x(t), g_y(t))1), with mean (gx(t),gy(t))(g_x(t), g_y(t))2 ((gx(t),gy(t))(g_x(t), g_y(t))3).

Face-to-Face Validation:

A dataset of 170 speed-dating conversations (4 minutes each, third-person camera) was annotated by experts on a 1–6 eye contact scale. ICE computed each participant’s (gx(t),gy(t))(g_x(t), g_y(t))4 (fraction of frames labeled as RVE, (gx(t),gy(t))(g_x(t), g_y(t))5). Across participants, Pearson’s (gx(t),gy(t))(g_x(t), g_y(t))6 ((gx(t),gy(t))(g_x(t), g_y(t))7) between ICE (gx(t),gy(t))(g_x(t), g_y(t))8 and mean expert ratings, and (gx(t),gy(t))(g_x(t), g_y(t))9 correlation between mean <0.9<0.90 by rating level.

6. Behavioral and Applied Insights

ICE’s framewise gaze labels enable robust quantification of gaze behavior in complex affective communication contexts:

Deception Detection:

  • Dataset: 87 dyads (47 bluffers, 38 truth-tellers) in incentivized interrogations.
  • Features: Normalized frequencies (<0.9<0.91) of gaze in each of 9 regions.
  • Statistical Test: Truth-tellers exhibited significantly greater frequency of downward gaze (<0.9<0.92) during questioning (<0.9<0.93 after correction, Cohen's <0.9<0.94).
  • Predictive Modeling: Logistic regression using (a) affective features (Affdex, 8 emotions + valence + engagement), (b) ICE region frequencies, and (c) both; affective features alone yielded <0.9<0.9552–55% accuracy, ICE alone 64.3% (<0.9<0.96), and combined 66.0% (<0.9<0.97, log-loss 0.677).

Speed Dating Skill Assessment:

  • Dataset: 170 face-to-face videos rated on conversational skill and eye contact.
  • Features: Video means of 17 facial Action Units (OpenFace) and <0.9<0.98.
  • Model: LASSO regression, five-fold CV.
  • Results: For Conversational Skill, mean squared error (MSE) was reduced from 1.307 (AU only) to 1.268 (with ICE); for Eye Contact, from 1.756 to 1.717. The <0.9<0.99 feature was the strongest predictor for overall conversational skill, and second strongest for expert-rated eye contact.

7. Significance and Implications

ICE provides a hardware- and annotation-free approach for extracting interpersonal gaze from arbitrary video, producing discrete, partner-relative gaze features that correlate with both ground truth sensors and expert human judgment. Its utility is demonstrated both in controlled settings (objective eye-tracker validation) and complex, naturalistic interactions (conversational skill, deception detection). ICE’s findings—such as the association between downward gaze and truth-telling or the superior predictive power of interpersonal gaze over facial expressions—reveal new directions in social and affective computing methodology (Tran et al., 2019). A plausible implication is that ICE could generalize to other multi-party and cross-cultural communication contexts where camera geometry and ground truth are unavailable or impractical.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Interpersonal-Calibrating Eye-gaze Encoder (ICE).