Interpersonal Eye-Gaze Encoder (ICE)
- ICE is a computational framework that estimates a conversational partner’s eye gaze by dynamically clustering gaze vectors from standard RGB video without special hardware or calibration.
- It operationalizes gaze into a 3×3 grid by identifying the region of primary visual engagement, achieving an F1 score of 0.846 and significant expert correlation.
- ICE demonstrates practical applications in deception detection and social skill assessment, offering actionable insights into nonverbal communication.
The Interpersonal-Calibrating Eye-gaze Encoder (ICE) is a computational framework designed to estimate interpersonal eye gaze—that is, gaze directed at a conversational partner—directly from conventional RGB video without requiring specialized hardware, camera calibration, or prior physical layout information. ICE leverages the empirical observation that, during dyadic conversations, participants allocate most of their gaze toward one another’s faces. The key technical innovation is a dynamic clustering approach that discovers the partner’s location in the subject’s gaze-space purely from video-derived gaze data. ICE discretizes gaze into a 3×3 interpersonal grid referenced to the empirically discovered region of primary visual engagement (RVE), thus operationalizing “does-subject-look-at-partner?” for each frame. The framework has been validated against both objective (infrared eye-tracking, ) and subjective (expert eye-contact ratings, ) ground truth, and has demonstrated novel behavioral insights in domains such as deception detection and social skill assessment (Tran et al., 2019).
1. Motivation and Conceptual Architecture
Traditional video-based gaze tracking yields gaze vectors relative to the camera’s coordinate frame. For applications requiring the estimation of interpersonal gaze—such as quantifying eye contact—conversion to partner-relative coordinates typically demands either (a) instrumentation with infrared trackers or (b) knowledge of the physical setup (e.g., camera and screen location, seating arrangement). ICE eliminates these requirements by positing that the densest cluster of a participant’s gaze vectors over a session corresponds to their conversational partner’s face. ICE processes conventional videos, requiring no manual annotation or calibration, and outputting, for every frame, a label indicating which of nine discrete interpersonal regions the gaze falls into, with the RVE defined as region 5 (center). The high-level pipeline comprises four stages:
- Face and eye detection with raw gaze angle extraction.
- Assembly of a temporally ordered series of gaze points in angle space.
- Density-based dynamic clustering to discover the RVE.
- Re-normalization of gaze into a standardized 3×3 interpersonal region grid.
2. Video-to-Gaze Pipeline
The ICE processing pipeline algorithmically transforms RGB video into interpersonal gaze labels by the following sequence:
- Face and Eye Detection & Gaze Extraction: For each video frame , OpenFace 2.0 is employed to detect facial landmarks and estimate the gaze vector as , which encodes eye-in-head angles (in radians) relative to the camera. Frames with OpenFace confidence are discarded for reliability.
- Downsampling and Smoothing: To mitigate noise inherent in frame-level gaze estimation, the temporal sequence is downsampled (e.g., to 3 fps) by majority vote over blocks, yielding a more robust set of gaze points.
- Assembly of Gaze Point Cloud: The resulting collection of two-dimensional gaze angles, , constitutes the input “cloud” for clustering.
This processed dataset forms the basis for unsupervised identification of gaze regions corresponding to the conversational partner.
3. Dynamic Clustering Algorithm: Mathematical Formulation and Pseudocode
The core of ICE is a dynamic, unsupervised clustering routine based on DBSCAN (Density-Based Spatial Clustering of Applications with Noise) [Ester et al. 1996]. This approach models the distribution of gaze points in the angular plane and automatically finds dense regions without a priori knowledge. The algorithm parameters are dynamically determined:
- Neighborhood radius (): Searched over radians in decrements of 0.
- Minimum points (1): Set to 2, i.e., 1% of the total number of frames.
Clusters are identified using the standard DBSCAN definitions:
- 3-Neighborhood: 4.
- Core Point: 5 is a core point iff 6.
- Clusters 7 are extracted; others are labeled as noise.
Parameter Selection: The optimal 8 is selected such that the number of clusters is at least two, the largest and second largest clusters satisfy 9, and 0 is not noise. If no 1 meets these, fallback 2 is used.
Pseudocode for ICE Calibration:
0 Each gaze frame receives an interpersonal region label 3 indicating gaze as left/right/center/up/down and diagonals relative to the empirically estimated RVE.
4. Validation Metrics
ICE is quantitatively validated against ground truth using two principal metrics:
- Video Chat (Objective): Precision, Recall, and 4-score between ICE's RVE-detection and an infrared gaze tracker (Gazepoint GP3). The core task is binary: “Is gaze in RVE?”.
- Face-to-Face (Subjective): Pearson correlation 5 between ICE's estimated proportion of RVE-directed gaze and human expert ratings of eye contact (six-point scale).
The following table summarizes these metrics:
| Context | Sample Size (6) | Metric | Value |
|---|---|---|---|
| Video Chat | 8 | 7 | 0.846 |
| Face-to-Face | 166 | Pearson 8 | 0.37 |
The high 9 and significant correlation coefficients demonstrate both objective and subjective validity of ICE in diverse conversational settings.
5. Experimental Setup and Results
Video Chat Validation:
Dyadic conversations were recorded (15 fps) while participants simultaneously wore a Gazepoint GP3 IR tracker (60 Hz). Data streams were temporally aligned via cross-correlation. ICE region assignments (3 fps, downsampled) were compared framewise to IR tracker results; mean accuracy was 0 (1), with mean 2 (3).
Face-to-Face Validation:
A dataset of 170 speed-dating conversations (4 minutes each, third-person camera) was annotated by experts on a 1–6 eye contact scale. ICE computed each participant’s 4 (fraction of frames labeled as RVE, 5). Across participants, Pearson’s 6 (7) between ICE 8 and mean expert ratings, and 9 correlation between mean 0 by rating level.
6. Behavioral and Applied Insights
ICE’s framewise gaze labels enable robust quantification of gaze behavior in complex affective communication contexts:
Deception Detection:
- Dataset: 87 dyads (47 bluffers, 38 truth-tellers) in incentivized interrogations.
- Features: Normalized frequencies (1) of gaze in each of 9 regions.
- Statistical Test: Truth-tellers exhibited significantly greater frequency of downward gaze (2) during questioning (3 after correction, Cohen's 4).
- Predictive Modeling: Logistic regression using (a) affective features (Affdex, 8 emotions + valence + engagement), (b) ICE region frequencies, and (c) both; affective features alone yielded 552–55% accuracy, ICE alone 64.3% (6), and combined 66.0% (7, log-loss 0.677).
Speed Dating Skill Assessment:
- Dataset: 170 face-to-face videos rated on conversational skill and eye contact.
- Features: Video means of 17 facial Action Units (OpenFace) and 8.
- Model: LASSO regression, five-fold CV.
- Results: For Conversational Skill, mean squared error (MSE) was reduced from 1.307 (AU only) to 1.268 (with ICE); for Eye Contact, from 1.756 to 1.717. The 9 feature was the strongest predictor for overall conversational skill, and second strongest for expert-rated eye contact.
7. Significance and Implications
ICE provides a hardware- and annotation-free approach for extracting interpersonal gaze from arbitrary video, producing discrete, partner-relative gaze features that correlate with both ground truth sensors and expert human judgment. Its utility is demonstrated both in controlled settings (objective eye-tracker validation) and complex, naturalistic interactions (conversational skill, deception detection). ICE’s findings—such as the association between downward gaze and truth-telling or the superior predictive power of interpersonal gaze over facial expressions—reveal new directions in social and affective computing methodology (Tran et al., 2019). A plausible implication is that ICE could generalize to other multi-party and cross-cultural communication contexts where camera geometry and ground truth are unavailable or impractical.