Event-based Segmentation Dataset (ESD)

Updated 16 May 2026

ESD is a family of datasets providing instance-level segmentation ground truth using synchronized event streams and APS frames from robotic setups.
It applies systematic degradation protocols—occlusion, motion blur, lighting, scale, and trajectory—to benchmark segmentation robustness under realistic conditions.
The dataset enables cross-modal and zero-shot evaluation, supporting analysis of novel object generalization and fusion of event-based and RGB data.

The Event-based Segmentation Dataset (ESD) is a family of datasets designed to support the development and evaluation of segmentation algorithms for event-based vision sensors. ESD benchmarks exploit the unique capabilities of event cameras—microsecond temporal resolution, high dynamic range, and resilience to motion blur—by providing instance- or semantic-level ground truth aligned with raw asynchronous events. ESDs primarily target scenarios such as robotic manipulation in dynamic environments, but also address broader scenes including autonomous driving, biological tracking, and general dynamic scene understanding. Different ESD variants share a common focus on challenging segmentation tasks under complex conditions like occlusion, fast motion, low light, and clutter, offering a critical platform for advancing neuromorphic vision research (Kachole et al., 2023, Kachole et al., 2023).

1. Dataset Variants and Acquisition Protocols

Most event-based segmentation datasets under the "ESD" designation—particularly as described in (Kachole et al., 2023) and (Kachole et al., 2023)—are built upon DAVIS346 (346×260 px) or similar event+APS setups mounted on robotic manipulators ("eye-in-hand") for robotic grasping. These datasets comprise both RGB/APS frames (typically 40 Hz) with per-pixel instance masks and asynchronous event streams; all visual data is spatially and temporally aligned via common optics and hardware triggering.

ESD-1: Comprises 13,984 training APS frames and 3,202 test frames, featuring 10 known object types (e.g., bottle, box, mouse, pouch, book, platform), with each class roughly uniformly distributed.
ESD-2: Contains 5 object types held out entirely from training, enabling evaluation of cross-domain generalization and zero-shot segmentation.
Environmental variation: Each sequence systematically covers five axes of degradation: clutter/occlusion (2–10 objects per scene), trajectory (linear, rotational, partial-rotational motion), camera speed (0.1–1 m/s), lighting (normal vs. low-light), and scale (camera height 62 cm vs. 82 cm) (Kachole et al., 2023, Kachole et al., 2023).

Data volume, acquisition design, and class splits are consistent across releases, with 17,186 APS frames and 177 event streams as the canonical ESD scale. Some variants, such as (Huang et al., 2023), use stereo event cameras and RGBD for depth-labeled segmentation, but the robotic grasping-focused ESDs center on robotic-arm deployments.

2. Event Data Representation and Preprocessing

ESDs employ raw DVS data as streams of address-event tuples $(x_i, y_i, t_i, p_i)$ , where $(x, y)$ is the pixel location, $p \in \{0,1\}$ is event polarity, and $t$ is the microsecond timestamp. Events are typically processed into temporally binned tensors or event frames for network input:

Binned event frames: Events are accumulated between two APS timepoints (e.g., $T=25$ ms), forming 296×296×2 tensors (one channel per polarity). The event frame formula is:

$E_t(x, y, p) = \sum_{\forall e}\mathrm{rect}\left(\frac{t_e}{T}-0.5 - t\right) \delta_{x,x_e}\,\delta_{y,y_e}\,\delta_{p,p_e}$

where $\delta$ is the Kronecker delta and $\mathrm{rect}(\cdot)$ defines the bin (Kachole et al., 2023).

Noise suppression: On-sensor thresholding and additional spatial/temporal filtering reduce spurious events.
Graph construction: In some ESD studies (Kachole et al., 2023), sparse event graphs built from asynchronous events in 100 ms windows (capped at $N_{max}=10,000$ nodes) are used directly by GNN approaches, avoiding rasterization and thereby preserving temporal fidelity.

APS/RGB frames are center-cropped and resized (e.g., 296×296×3), with pixel-wise masks annotated per frame.

3. Annotation, Ground Truth, and Degradation Protocols

Annotation in ESD is primarily manual: expert annotators create per-pixel instance masks for all objects and background in each APS frame. For event labels, alignment is achieved by time-window association: all events within an interval inherit the label(s) of the aligned APS mask. In some protocols, especially for fine event-wise analysis, events are labeled directly via geometric projection of masks into event camera coordinates (potentially refined via iterative closest point registration as in (Huang et al., 2023)).

The primary challenge axes used to define degradation protocols are:

Occlusion (clutter): 2–10 objects per scene, directly modulating the occlusion ratio.
Motion blur (speed): Scenes at 0.1, 0.15, and 0.3 m/s; high-speeds up to 1 m/s are also present.
Brightness variation: Alternating between normal and low-light regimes.
Trajectory: Linear, full or partial rotation of the camera/arm.
Scale (height): 62 cm vs. 82 cm camera vertical position (Kachole et al., 2023, Kachole et al., 2023).

Controlled protocol design enables systematic evaluation of algorithm robustness across real-world degradations.

4. Evaluation Metrics, Task Definitions, and Baseline Results

ESD benchmarks tasks include both semantic and instance segmentation under multi-modal input regimes (event-only, APS-only, and fused):

Mean Intersection over Union (mIoU):

$\mathrm{mIoU} = \frac{1}{C}\sum_{c=1}^C \frac{TP_c}{TP_c + FP_c + FN_c}$

where $(x, y)$ 0, $(x, y)$ 1, $(x, y)$ 2 are true/false positives/negatives for class $(x, y)$ 3 (Kachole et al., 2023, Kachole et al., 2023).

Pixel-wise accuracy:

$(x, y)$ 4

Graph-based accuracy: For graph-structured evaluation, accuracy is computed per event node, with assignment ratios aggregated over classes or objects (Kachole et al., 2023).

No precision/recall/F1 metrics are standard in the original ESD studies, but other event-based segmentation datasets often include IoU, F1, and detection-rate figures (Zhou et al., 2020, Huang et al., 2023).

Method	mIoU (%)	Pixel Accuracy (%)
RFNet [Sun20]	58.4	60.0
ISSAFE [Zhang20]	66.2	65.0
SA-Gate [Pan19]	69.3	76.0
CMX [Liu22]	76.1	86.0
BimodalSegNet	83.2	87.2
GMNN [GraphMixer]	78.3	96.9

Benchmarking includes both known and zero-shot unknown object splits, as well as per-challenge breakdowns (e.g., clutter, lighting, speed). GMNN and BimodalSegNet lead across all degradation axes, with GMNN notably outperforming prior art in event-accuracy and boundary precision (Kachole et al., 2023, Kachole et al., 2023).

5. Data Format, Access, and Licensing

The canonical ESD releases provide full documentation for code, dataset, and evaluation scripts:

Data organization: Separate directories per split (train/test/val), with APS frames (PNG), per-pixel instance masks (PNG), and event streams (binary or numpy for $(x, y)$ 5 tuples). Auxiliary files may include calibration parameters and time-indexing for alignment.
Download: The ESD dataset and code are available publicly via GitHub (https://github.com/KingstonARC/ESD_Dataset; access subject to paper publication).
License: Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). Only non-commercial research use is permitted; redistribution and publication require citation of the original ESD papers (Kachole et al., 2023).

6. Significance, Limitations, and Extensions

The ESD family defines the state of the art in event-based segmentation benchmarking for robotic vision. Key advances include:

Systematic degradation challenges: ESD is the first to provide 5-axis controlled degradations (occlusion, lighting, speed, scale, trajectory) for robotic grasping settings (Kachole et al., 2023).
Fine alignment and annotation: ESD provides dense, manually-reviewed instance masks for both frames and events, with protocols to ensure ground-truth transfer between modalities.
Cross-modal and cross-domain evaluation: The known/unknown object split and fusion baselines enable direct tests of zero-shot generalization.
Robust evaluation metrics: Consistent adoption of mIoU and event/graph-accuracy metrics facilitates fair comparisons across methods and input types (Kachole et al., 2023, Kachole et al., 2023).

A plausible implication is that ESD's experimental design—systematic degradations, dual-modality alignment, and zero-shot splits—can serve as a template for next-generation event-based benchmarks in vision-related domains.

Current limitations include restriction to a modest number of object categories, focus on controlled robotic-tabletop settings, and the absence of nighttime or highly unstructured outdoor scenes, which remain open research directions. Other event-based segmentation datasets described in the literature—such as (Zhou et al., 2020, Huang et al., 2023), or synthetic frameworks like SEVD (Aliminati et al., 2024)—extend the ESD methodology to additional domains or sensor configurations, but may lack the same level of frame-event ground-truth alignment and instance-level annotations under systematic degradation protocols.