EyeNet: Vision & Medical Imaging Models

Updated 4 April 2026

EyeNet is a suite of deep learning models and datasets designed for tasks such as 3D point cloud segmentation, eye region analysis, and retinal disease diagnosis.
The models integrate human-vision principles, dual-contour sampling, multi-task learning, and attention mechanisms to improve data efficiency and accuracy.
Applications of EyeNet span outdoor scene segmentation, AR/VR eye gaze estimation, and clinical retinal screening, offering robust and lightweight solutions.

EyeNet is the designation for several distinct deep learning models and datasets in computer vision and medical imaging. The term appears in the literature for: (1) a human-vision-inspired 3D point cloud semantic segmentation network for large-scale outdoor scenes; (2) multi-task neural architectures for eye-gaze estimation and eye-region semantic tasks in head-mounted display (HMD) setups; (3) an attention-based eye-region semantic segmentation network for AR/VR; and (4) a curated retinal disease image dataset for diagnosis. Each variant arises from different problems and communities, but all are marked by the term “EyeNet” in their respective works.

1. Human-Vision-Based EyeNet for 3D Point Cloud Segmentation

The EyeNet proposed by Zhang et al. (Yoo et al., 2023) targets semantic segmentation of large-scale 3D outdoor point clouds, addressing the critical problem of effective receptive field relative to the batch coverage area. Conventional approaches face a trade-off: small crops lack context; large crops force aggressive downsampling or breach GPU memory constraints. EyeNet is motivated by human peripheral vision, whose architecture fuses high-density local (foveal) with low-density global (peripheral) visual information.

Multi-Contour Input Representation

A random crop centered at $C_0$ is used. Let $P_c = \{p_c^1,\dots,p_c^N\}$ be the $N$ nearest points, and $R = \max_{p\in P_c}\|p - C_0\|$ defines the central radius. The crop is split into:

A central region (radius $R$ ) with $N$ points,
A peripheral region (radius $2R$), with $3N/4$ newly sampled points from the annulus and $N/4$ “messenger” points copied from the central region.

This dual-contour mechanism allows coverage expansion with minimal increase in total unique points: $7N/4$.

Parallel Processing and Connection Blocks

Two RandLA-Net encoder–decoders operate in parallel: one on the central, one on the peripheral points. At each encoder level $P_c = \{p_c^1,\dots,p_c^N\}$ 0, “messenger” features are exchanged via a Connection Block consisting of shared MLPs, self-attentive pooling (SAP), channel enhancement (CE), and redistribution. After the parallel streams, a Feature Merging Block fuses outputs using squeeze-and-excitation before final per-point class prediction.

Loss Functions and Optimization

Main supervisory signal: Lovász-Softmax loss, directly optimizing mean IoU.
Local feature extraction uses KNN-based local feature aggregation per RandLA.

Experimental Setup and Results

EyeNet was validated on four large datasets (SensatUrban, Toronto3D, DALES, YUTO). It achieves 62.3% mIoU on SensatUrban (outperforming RandLA++ at 57.1% and KPConv at 57.6%), 81.13% mIoU on Toronto3D, and similar gains across the others. Ablation studies reveal that the multi-contour and parallel design deliver nearly 9 pp mIoU gain over the baseline, and Connection Blocks further add nearly 1 pp. The method demonstrates strong data efficiency, with ~60% fewer unique points needed for equivalent context compared to conventionally-sized batches.

2. EyeNet for Multi-Task Eye Gaze and Semantic User Understanding in HMDs

EyeNet is also the name of a multi-task deep neural architecture for off-axis gaze estimation and semantic eye analysis in VR/MR devices (Wu et al., 2019, Wu et al., 2020). The goal is a unified model for segmentation, keypoint localization, blink detection, expression classification, and gaze estimation under challenging off-axis IR eye imaging.

Model Architecture

Backbone: ResNet-50 with Feature Pyramid Network (FPN), processing normalized $P_c = \{p_c^1,\dots,p_c^N\}$ 1 grayscale crops.
Segmentation branch: lightweight decoder produces per-pixel semantic (sclera, iris, pupil, background).
Localization branch: heat-map decoders for pupil and glints (5-keypoints).
Presence branch: classifies on/off state for each keypoint.
Cornea center regression: model-based supervision using geometric constraints from IR LED-glint geometry.
Blink and expression detection: shallow FC classifiers, some trained per-user.
Gaze mapping: “DeepGazeMapper”—5-layer FC network mapping 3D optical axis to visual axis, calibrated per subject.

Losses and Supervision

A weighted sum of segmentation (cross-entropy), keypoint localization (heatmap log-loss), presence (binary cross-entropy), cornea regression (Huber or geometric reprojection error), blink, and expression losses. Cornea center location is supervised by geometric computation (LED–glint–pupil geometry), not manual annotation.

Dataset and Evaluation

The MagicEyes dataset (Wu et al., 2020) provides the empirical basis: 587 subjects, $P_c = \{p_c^1,\dots,p_c^N\}$ 2 human-labeled images, $P_c = \{p_c^1,\dots,p_c^N\}$ 3 gaze-labeled. EyeNet achieves 97.29% pixel-wise semantic segmentation accuracy, pupil localization errors of 0.46 px, and gaze estimation mean errors of 2.99° after fine-tuned mapping. The unified pipeline reduces hand-engineered heuristics, improves robustness under occlusion and off-axis conditions, and supports energy-efficient VR rendering.

3. EyeNet Attention-Based Encoder–Decoder for Eye Region Segmentation

EyeNet (Kansal et al., 2019) is an attention-based, residual encoder–decoder for multiclass (sclera, iris, pupil, background) per-pixel segmentation, principally targeting the OpenEDS challenge. The architecture incorporates:

Modified residual units in both encoder and decoder,
Channel and spatial attention (CBAM) at the bottleneck,
Channel-Squeeze Spatial-Excitation (CS-SE) at each decoding stage,
Multi-scale side-output supervision, with combined cross-entropy and softmax-Dice loss.

Empirical Metrics

On OpenEDS, EyeNet achieves mIoU = 95.5% (val), 94.9% (test), and mean Dice 0.970. The model outperforms the mSegNet baseline by 4.8 pp mIoU with only 1/9 the parameter count (1.48M).

Ablation Studies

Removal of CBAM or CS-SE modules causes nontrivial mIoU drops (0.9 and 2.5 pp, respectively). Coordinate convolution and multi-scale Dice supervision enhance edge precision and overall segmentation reliability.

4. EyeNet Dataset for Multi-Class Retina Disease Diagnosis

EyeNet (Yang et al., 2018) refers to a curated dataset for 32-class retinal disease diagnosis, comprising color fundus images drawn from the Retina Image Bank. Each image is accompanied by a clinical diagnosis as provided by a board-certified ophthalmologist.

Dataset Structure and Usage

Train/val/test splits: 70/10/20%, with random (not patient-wise) allocation.
Classes cover a broad spectrum (e.g., AMD, disc hemorrhage).
Preprocessing: U-Net segmentation of vasculature for vessel-features, PCA reduction for SVM input.
Classification: Hybrid deep+SVM pipeline, yielding 89.73% overall classification accuracy.

Comparison With Other Datasets

EyeNet differs from DR-only (Kaggle, etc.) or vessel-only (DRIVE, STARE) datasets by spanning 32 pathologies and using clinical labels from practicing clinicians.

While not itself “EyeNet,” DeepEyeNet (Roy et al., 19 Jan 2025) is recent and relevant: it builds on advanced neural and optimization techniques for glaucoma diagnosis using fundus images. The AGBO-optimized ConvNeXtTiny + U-Net feature extraction pipeline achieves state-of-the-art 95.84% accuracy on the EyePACS-AIROGS-light-V2 dataset.

6. Key Innovations, Limitations, and Outlook

Several common themes emerge across EyeNet models:

Motivations grounded in biomimetic vision (e.g., large-field context via peripheral sampling (Yoo et al., 2023)).
Multi-task learning for joint semantic/regression/classification tasks under challenging imaging conditions (Wu et al., 2019, Wu et al., 2020).
Lightweight, computationally efficient design for deployment (notably in AR/VR, mobile settings) (Kansal et al., 2019).
Integration with, or derived from, clinical workflows (for medical EyeNet datasets).

Limitations in various modes include:

Fixed backbone or specific block design (e.g., RandLA-Net on the 3D EyeNet).
Per-subject calibration needs for gaze estimation.
Heterogeneous data restrictions (fundus only, patient mix, etc.).

Future work suggested in these papers includes porting core ideas (dual-contour, attention, optimization) to stronger or more general backbones, developing learned/adaptive sampling for 3D segmentation, extending to multi-disease screening, and adapting models for mobile/real-time use.

7. Bibliographic Table

EyeNet Variant	Core Domain / Task	Key Reference
3D Point Cloud Segmentation	Large-scale outdoor semantic segmentation	(Yoo et al., 2023)
Multi-task Gaze & Eye Understanding	Eye region analysis in HMDs	(Wu et al., 2019 Wu et al., 2020)
Eye Region Segmentation (VR/AR)	Four-way eye semantic segmentation	(Kansal et al., 2019)
Retinal Disease Dataset	32-class clinical fundus classification	(Yang et al., 2018)

References

(Yoo et al., 2023): Human Vision Based 3D Point Cloud Semantic Segmentation of Large-Scale Outdoor Scene
(Wu et al., 2019): EyeNet: A Multi-Task Network for Off-Axis Eye Gaze Estimation and User Understanding
(Wu et al., 2020): MagicEyes: A Large Scale Eye Gaze Estimation Dataset for Mixed Reality
(Kansal et al., 2019): Eyenet: Attention based Convolutional Encoder-Decoder Network for Eye Region Segmentation
(Yang et al., 2018): A Novel Hybrid Machine Learning Model for Auto-Classification of Retinal Diseases
(Roy et al., 19 Jan 2025): DeepEyeNet: Adaptive Genetic Bayesian Algorithm Based Hybrid ConvNeXtTiny Framework For Multi-Feature Glaucoma Eye Diagnosis