EgoSpanLift: 3D Visual Span Forecasting

Updated 30 November 2025

EgoSpanLift is a spatio-temporal forecasting method that predicts human gaze in 3D using structured voxel representations derived from SLAM keypoints and gaze directions.
It leverages multi-scale volumetric encoding by partitioning voxel grids into angular spans, capturing foveal to peripheral vision with precise spatial delineation.
The approach achieves state-of-the-art IoU improvements over baseline methods while enabling real-time predictions on commodity GPUs for AR/VR and robotics applications.

EgoSpanLift is a method for spatio-temporal forecasting of human visual span in egocentric 3D environments, transitioning gaze prediction from traditional 2D image coordinates to structured, multi-scale 3D voxel representations. By leveraging inputs from simultaneous localization and mapping (SLAM)-derived keypoints, precise gaze directions, and camera trajectories, EgoSpanLift encodes recent visual behavior into volumetric occupancy grids—partitioned into multiple angular spans—and forecasts where an observer’s gaze and peripheral attention will focus in the future. The method advances anticipatory scene understanding for applications in AR/VR, robotics, and assistive perception, and establishes the first large-scale, real-time benchmark for egocentric 3D visual span prediction (Yun et al., 23 Nov 2025).

1. Formal Problem Statement

EgoSpanLift addresses the task of egocentric 3D visual-span forecasting. The system assumes access to: - A video stream and SLAM-recovered semi-dense keypoints $\mathcal{P} = \{ (\mathbf p_i, \sigma_i, t_i) \}$ over time, - Camera-to-world poses $\mathcal E = \{\mathbf E_t \in \mathrm{SE}(3)\}$ (extrinsics), - Framewise gaze directions $\{ \mathbf g_t \in \mathbb R^3 \}$ in the local camera (“pupil”) frame, - A fixed temporal observation window $[t_0 - T_p, t_0]$ prior to current time $t_0$ .

The prediction target is a set of 3D binary occupancy grids $\tilde V^{(\theta)} \in \{0, 1\}^{4 \times R \times R \times R}$ , where each channel corresponds to a distinct visual span—head orientation ( $\theta = 55^\circ$ ), near-peripheral ( $30^\circ$ ), central ( $8^\circ$ ), and foveal ( $2^\circ$ ). These occupancy volumes forecast which voxels in a surrounding cube ( $D = 3.2$ m) will be within the user’s visual span during a future interval of $T_f$ seconds (2 s typical for everyday activity, 4 s for skilled tasks).

2. Geometry Conversion and Volumetric Encoding

The transformation from raw SLAM points and gaze vectors to 3D span volumes consists of three steps:

Keypoint Filtering: At each time $t$ , EgoSpanLift selects keypoints from the SLAM stream,

$\mathcal P_t =\left\{ p_i \in \mathcal P \mid t_i = t,\; \|\mathbf p_i - \mathbf t_t\|_1 < \frac{D}{2},\; \mathcal I_f(p_i; \mathcal P) = 1 \right\} ,$

retaining only those inside a local cube centered at the camera and passing a spatial outlier filter $\mathcal I_f$ .

Gaze-Compatible Classification: Each $p_i$ is transformed into the local pupil frame, $\mathbf q_i = \mathbf E_t^{-1} \mathbf p_i$ . A keypoint is assigned to a span $Q_t^{\theta, \mathbf g_t}$ if

$\frac{\langle \mathbf q_i, \mathbf g_t \rangle}{\|\mathbf q_i\| \|\mathbf g_t\|} > \cos \theta .$

Angular thresholds for $\theta$ are set (2°, 8°, 30°, 55°), delineating foveal through orientation-level spans.

Voxel Occupancy Encoding: Across interval $[t_b, t_e]$ , all keypoints in $Q_t^{\theta}$ are embedded into an $R^3$ Cartesian grid (default 20 cm voxels), via:

$V_{[t_b, t_e]}^{\theta}(i, j, k) = \mathcal I \left( \left| \left\{ p \in \bigcup Q_t^\theta : 0 \leq (p - \mathbf t_{t_b} + D/2)\frac{R}{D} - (i, j, k) \leq 1 \right\} \right| > 0 \right),$

applying the indicator function $\mathcal I$ for binary occupancy. By stacking the four visual spans plus a general scene channel, EgoSpanLift forms a 5-channel volume for every timepoint.

3. Spatio-Temporal Network Architecture

EgoSpanLift combines spatio-temporal encoding with volumetric prediction via the following architecture:

Input Construction: An input tensor $X \in \mathbb R^{T_p \times 5 \times R \times R \times R}$ encodes the full $T_p$ -second window as a sequence of 5-channel volumes.
3D U-Net Encoder: Each $5 \times R^3$ frame is passed through a 3D U-Net encoder path, consisting of repeated 3D convolutions (kernel $3^3$ ), ReLU activations, and max-pooling. Features double in channel count per stage, reaching $C=128$ .
Global Embedding: The entire input is average-pooled and projected to $\mathbb R^C$ , concatenated with the temporal sequence as a special “head” token.
Unidirectional Transformer: The combined sequence is processed by a transformer encoder with causal masking, implementing multi-head self-attention and layer normalization. The output for the head token, $\tilde e_{\rm head}$ , summarizes the temporal context.
3D U-Net Decoder: The embedding $\tilde e_{\rm head}$ is upsampled through transpose convolutions and skip-connections with encoder features. The final predicted tensor $\hat Y \in \mathbb R^{4 \times R \times R \times R}$ (after sigmoid) provides soft occupancy for each future visual span.

4. Training Procedure, Losses, and Inference

EgoSpanLift is supervised using a multi-class Dice loss, a metric particularly suited for sparse target prediction in volume grids: $\mathcal L = 1 - \frac{2 \sum \hat Y \odot Y}{\sum \hat Y + \sum Y + 1},$ where all sums are over the four span channels and all voxels, and the constant $1$ smooths division. Binary cross-entropy was examined in ablation, but reduced foveal span performance by $12\%$ IoU relative to Dice loss. The authors indicate that inference on commodity GPUs (12 GB VRAM) achieves a runtime of 71 ms per sample. No details are specified for batch size, learning rate schedules, or augmentation (Yun et al., 23 Nov 2025).

5. Benchmark Datasets and Evaluation Protocol

A large-scale, two-part benchmark was constructed for evaluation:

Dataset	Domain/Source	Window/Stride	N Train / Val / Test	Task
FoVS-Aria	Aria Everyday Activities	2 s/1 s	19.3K/1.9K/2.1K	Forecast 2 s 3D span
FoVS-EgoExo	Ego-Exo4D (skilled tasks)	4 s/—	274.7K/29.6K/37.0K	Forecast 4 s 3D span

Evaluation metrics:

3D Intersection-over-Union (IoU) for each span is the primary metric;
Voxelwise F1 is also reported;
For foveal spans, min/avg/max 3D localization errors (cm);
For image-projected results, F1, precision, and recall on the image plane.

6. Quantitative Results

EgoSpanLift consistently improves upon prior baselines in both everyday and skilled-activity domains:

FoVS-Aria test:

Orientation span IoU: 0.5838 (EgoSpanLift) vs. 0.4959 (EgoChoir baseline)
Peripheral 30° IoU: 0.4886 vs. 0.4553
Central 8° IoU: 0.3513 vs. 0.2612
Foveal 2° IoU: 0.2836 vs. 0.1987
Foveal 3D localization error: mean 34.85 cm (ours) vs. 73.47 cm (best 2D baseline)

FoVS-EgoExo test:

Orientation IoU: 0.5230 (ours) vs. 0.3287 (EgoChoir)
Peripheral: 0.5108 vs. 0.2851
Central: 0.4212 vs. 0.1976
Foveal: 0.3692 vs. 0.1266

For projected 2D anticipation on FoVS-Aria (foveal span), EgoSpanLift matches the top-performing CSTS model (F1 = 0.515), without any 2D-specific training.

7. Analysis, Limitations, and Future Directions

The multi-span volumetric paradigm adopted by EgoSpanLift allows modeling of the dynamic coupling between global head orientation, broad peripheral vision, and fine-scale foveal fixations, enhancing the prediction of human visuomotor intent in 3D. Performance degradations of 15–40% IoU/F1 occur if either span-history or the global embedding are omitted. Binary cross-entropy loss impairs recall, especially for the foveal regime. Forecasting accuracy degrades smoothly as prediction horizon is extended, yet remains above all available prior methods.

The method relies on the presence of semi-dense SLAM keypoints; performance may degrade in environments with sparse features or low texture. Voxelization at 20 cm offers efficient memory but limits sub-voxel precision; 10 cm grids can be used for improved accuracy at higher resource cost.

A plausible implication is that extending EgoSpanLift to leverage multimodal sensing (audio, inertials) or to operate directly on dense 3D meshes/TSDFs may further improve fidelity and robustness. End-to-end learning from raw RGB–D to occupancy spans is suggested as a future direction. The result establishes EgoSpanLift as the first real-time approach for forecasting the spatial locus of human gaze in 3D ambient environments (Yun et al., 23 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Gaze Beyond the Frame: Forecasting Egocentric 3D Visual Span (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to EgoSpanLift.