EgoSpanLift: 3D Visual Span Forecasting
- EgoSpanLift is a spatio-temporal forecasting method that predicts human gaze in 3D using structured voxel representations derived from SLAM keypoints and gaze directions.
- It leverages multi-scale volumetric encoding by partitioning voxel grids into angular spans, capturing foveal to peripheral vision with precise spatial delineation.
- The approach achieves state-of-the-art IoU improvements over baseline methods while enabling real-time predictions on commodity GPUs for AR/VR and robotics applications.
EgoSpanLift is a method for spatio-temporal forecasting of human visual span in egocentric 3D environments, transitioning gaze prediction from traditional 2D image coordinates to structured, multi-scale 3D voxel representations. By leveraging inputs from simultaneous localization and mapping (SLAM)-derived keypoints, precise gaze directions, and camera trajectories, EgoSpanLift encodes recent visual behavior into volumetric occupancy grids—partitioned into multiple angular spans—and forecasts where an observer’s gaze and peripheral attention will focus in the future. The method advances anticipatory scene understanding for applications in AR/VR, robotics, and assistive perception, and establishes the first large-scale, real-time benchmark for egocentric 3D visual span prediction (Yun et al., 23 Nov 2025).
1. Formal Problem Statement
EgoSpanLift addresses the task of egocentric 3D visual-span forecasting. The system assumes access to: - A video stream and SLAM-recovered semi-dense keypoints over time, - Camera-to-world poses (extrinsics), - Framewise gaze directions in the local camera (“pupil”) frame, - A fixed temporal observation window prior to current time .
The prediction target is a set of 3D binary occupancy grids , where each channel corresponds to a distinct visual span—head orientation (), near-peripheral (), central (), and foveal (). These occupancy volumes forecast which voxels in a surrounding cube ( m) will be within the user’s visual span during a future interval of seconds (2 s typical for everyday activity, 4 s for skilled tasks).
2. Geometry Conversion and Volumetric Encoding
The transformation from raw SLAM points and gaze vectors to 3D span volumes consists of three steps:
- Keypoint Filtering: At each time , EgoSpanLift selects keypoints from the SLAM stream,
retaining only those inside a local cube centered at the camera and passing a spatial outlier filter .
- Gaze-Compatible Classification: Each is transformed into the local pupil frame, . A keypoint is assigned to a span if
Angular thresholds for are set (2°, 8°, 30°, 55°), delineating foveal through orientation-level spans.
- Voxel Occupancy Encoding: Across interval , all keypoints in are embedded into an Cartesian grid (default 20 cm voxels), via:
applying the indicator function for binary occupancy. By stacking the four visual spans plus a general scene channel, EgoSpanLift forms a 5-channel volume for every timepoint.
3. Spatio-Temporal Network Architecture
EgoSpanLift combines spatio-temporal encoding with volumetric prediction via the following architecture:
- Input Construction: An input tensor encodes the full -second window as a sequence of 5-channel volumes.
- 3D U-Net Encoder: Each frame is passed through a 3D U-Net encoder path, consisting of repeated 3D convolutions (kernel ), ReLU activations, and max-pooling. Features double in channel count per stage, reaching .
- Global Embedding: The entire input is average-pooled and projected to , concatenated with the temporal sequence as a special “head” token.
- Unidirectional Transformer: The combined sequence is processed by a transformer encoder with causal masking, implementing multi-head self-attention and layer normalization. The output for the head token, , summarizes the temporal context.
- 3D U-Net Decoder: The embedding is upsampled through transpose convolutions and skip-connections with encoder features. The final predicted tensor (after sigmoid) provides soft occupancy for each future visual span.
4. Training Procedure, Losses, and Inference
EgoSpanLift is supervised using a multi-class Dice loss, a metric particularly suited for sparse target prediction in volume grids: where all sums are over the four span channels and all voxels, and the constant $1$ smooths division. Binary cross-entropy was examined in ablation, but reduced foveal span performance by IoU relative to Dice loss. The authors indicate that inference on commodity GPUs (12 GB VRAM) achieves a runtime of 71 ms per sample. No details are specified for batch size, learning rate schedules, or augmentation (Yun et al., 23 Nov 2025).
5. Benchmark Datasets and Evaluation Protocol
A large-scale, two-part benchmark was constructed for evaluation:
| Dataset | Domain/Source | Window/Stride | N Train / Val / Test | Task |
|---|---|---|---|---|
| FoVS-Aria | Aria Everyday Activities | 2 s/1 s | 19.3K/1.9K/2.1K | Forecast 2 s 3D span |
| FoVS-EgoExo | Ego-Exo4D (skilled tasks) | 4 s/— | 274.7K/29.6K/37.0K | Forecast 4 s 3D span |
Evaluation metrics:
- 3D Intersection-over-Union (IoU) for each span is the primary metric;
- Voxelwise F1 is also reported;
- For foveal spans, min/avg/max 3D localization errors (cm);
- For image-projected results, F1, precision, and recall on the image plane.
6. Quantitative Results
EgoSpanLift consistently improves upon prior baselines in both everyday and skilled-activity domains:
FoVS-Aria test:
- Orientation span IoU: 0.5838 (EgoSpanLift) vs. 0.4959 (EgoChoir baseline)
- Peripheral 30° IoU: 0.4886 vs. 0.4553
- Central 8° IoU: 0.3513 vs. 0.2612
- Foveal 2° IoU: 0.2836 vs. 0.1987
- Foveal 3D localization error: mean 34.85 cm (ours) vs. 73.47 cm (best 2D baseline)
FoVS-EgoExo test:
- Orientation IoU: 0.5230 (ours) vs. 0.3287 (EgoChoir)
- Peripheral: 0.5108 vs. 0.2851
- Central: 0.4212 vs. 0.1976
- Foveal: 0.3692 vs. 0.1266
For projected 2D anticipation on FoVS-Aria (foveal span), EgoSpanLift matches the top-performing CSTS model (F1 = 0.515), without any 2D-specific training.
7. Analysis, Limitations, and Future Directions
The multi-span volumetric paradigm adopted by EgoSpanLift allows modeling of the dynamic coupling between global head orientation, broad peripheral vision, and fine-scale foveal fixations, enhancing the prediction of human visuomotor intent in 3D. Performance degradations of 15–40% IoU/F1 occur if either span-history or the global embedding are omitted. Binary cross-entropy loss impairs recall, especially for the foveal regime. Forecasting accuracy degrades smoothly as prediction horizon is extended, yet remains above all available prior methods.
The method relies on the presence of semi-dense SLAM keypoints; performance may degrade in environments with sparse features or low texture. Voxelization at 20 cm offers efficient memory but limits sub-voxel precision; 10 cm grids can be used for improved accuracy at higher resource cost.
A plausible implication is that extending EgoSpanLift to leverage multimodal sensing (audio, inertials) or to operate directly on dense 3D meshes/TSDFs may further improve fidelity and robustness. End-to-end learning from raw RGB–D to occupancy spans is suggested as a future direction. The result establishes EgoSpanLift as the first real-time approach for forecasting the spatial locus of human gaze in 3D ambient environments (Yun et al., 23 Nov 2025).