OV-EIS: Open-Vocabulary Event Segmentation
- OV-EIS is a novel task that segments spatiotemporal event data and assigns language-based open-set labels for flexible semantic scene interpretation.
- The SEAL framework integrates an EventSAM backbone with multimodal fusion and hierarchical semantic guidance to achieve state-of-the-art segmentation and classification metrics.
- Extensions such as SEAL++ enable prompt-free, real-time detection and segmentation, reducing computational costs while enhancing performance in robotic applications.
Open-Vocabulary Event Instance Segmentation (OV-EIS) denotes the task of segmenting instances in event camera streams and assigning open-set, language-grounded labels to each predicted mask. In OV-EIS, the system receives spatiotemporal event data from event-based sensors and a set of arbitrary text queries describing potential categories or parts. The objective is to produce segmentation masks delineating individual event instances and to select, for each instance, a semantically consistent label from the potentially unbounded vocabulary of input queries. This is performed without restricting classification to a fixed, pre-defined taxonomy, enabling both semantic and part-level scene understanding at arbitrary levels of granularity (Lee et al., 30 Jan 2026).
1. Formal Task Definition and Notation
Given a stream from an event camera as , where are spatial coordinates, is the timestamp, and represents polarity, events are aggregated into a regular representation (commonly using spatiotemporal voxel grids, frame reconstructions, or spike encoding). The system is provided with a set of free-form language queries and visual prompts (points or boxes). The OV-EIS task is to predict binary masks for each instance and assign a label from the open vocabulary of queries.
The functional pipeline is as follows:
- (class-agnostic mask generation)
- (similarity scoring)
- (selecting the top scoring query for each segment)
The core requirement of OV-EIS is that mask generation is class-agnostic, and label assignment leverages open-vocabulary classification through cross-modal feature alignment.
2. The SEAL Framework for OV-EIS
The SEAL ("Semantic-aware Segment Any Events with Language") architecture is designed to unify event-domain instance segmentation and open-vocabulary classification via a parameter-efficient model. The main components are:
- Input Representation: Events are discretized into voxel grids ( with three temporal bins), sampled over fixed windows (e.g., 25 ms for DSEC, 15 ms for DDD17). SEAL does not require frame-level reconstruction at inference and operates directly on event voxel grids for computational efficiency.
- Backbone: SEAL uses a ViT-B-based EventSAM backbone (), pretrained and then frozen, to encode into a token map . The mask decoder, adapted from SAM, generates class-agnostic binary masks based on visual prompts .
- Multimodal Hierarchical Semantic Guidance (MHSG): During training, hierarchical supervision is derived from paired RGB images and captions generated at three semantic levels: semantic, instance, and part. Visual guidance employs SAM to segment images; CLIP is used to extract visual and text features for alignment.
- Fusion Network: A multimodal fusion network is layered atop the backbone and consists of:
- a backbone feature enhancer (Transformer layers with cross-attention to text anchors),
- a spatial encoding module combining mask tokens and pooled semantic features,
- and a mask feature enhancer using cross-attention masked to the predicted region.
- Classification Head: For each candidate mask, cosine similarity is computed between the mask’s feature embedding and embeddings of user-provided queries, with the maximum determining the mask’s label.
3. Training Paradigm and Loss Functions
SEAL employs a two-stage training regime:
- Stage 1: EventSAM pretraining via distillation from SAM image features, using mixed event-image pairs (no human annotations). The backbone is pretrained using cross-entropy and Dice losses.
- Stage 2: Multimodal fusion training (MHSG) aligns event-based mask features to both CLIP-derived visual features and text features across three levels (semantic, instance, part). The objective is to minimize
where is the cosine similarity, weights text/visual alignment, and is the number of masks at level . After stage 1, the EventSAM backbone and decoder are frozen; only fusion modules and projection layers are updated in stage 2 (Lee et al., 30 Jan 2026).
4. Benchmark Datasets and Evaluation Protocols
Four OV-EIS benchmarks were curated from the DDD17-Seg and DSEC-Semantic datasets by generating SAM masks and assigning them semantic or part-level labels. Benchmarks differ in class granularity, sequence count, spatial resolution, and segmentation level:
| Benchmark | Classes | Events/Seq | Resolution | Levels |
|---|---|---|---|---|
| DDD17-Ins | 5 | 3,890 | 352×200 | semantic + instance |
| DSEC11-Ins | 7 | 2,809 | 640×440 | semantic + instance |
| DSEC19-Ins | 14 | 2,809 | 640×440 | semantic + instance |
| DSEC-Part | 9 (parts) | 2,809 | 640×440 | part-level only |
The principal metric is Average Precision (AP), computed as the mean of AP@τ over (area under the precision-recall curve at each IoU threshold). Auxiliary reporting includes AP50 and AP25.
5. Empirical Performance and Efficiency
SEAL achieves state-of-the-art results compared to prior open-vocabulary or event segmentation methods. On the DDD17-Ins, DSEC11-Ins, DSEC19-Ins, and DSEC-Part benchmarks, SEAL demonstrates gains in AP, AP50, and AP25 over all provided baselines, as summarized:
| Benchmark | Metric | Best Baseline | SEAL | Δ | Time (ms) | Params (M) |
|---|---|---|---|---|---|---|
| DDD17-Ins | AP | 29.8 (OpenSeg) | 32.3 | +2.5 | 22.3 | 99.1 |
| DSEC11-Ins | AP50 | 40.3 (frame2recon) | 55.1 | +14.8 | – | – |
| DSEC19-Ins | AP25 | 26.9 (frame2recon) | 36.3 | +9.4 | – | – |
| DSEC-Part | AP | 12.9 (VLPart) | 13.6 | +0.7 | – | – |
SEAL attains 45 FPS inference speed (22 ms per frame) on a single RTX 6000 Ada, with a memory footprint of 99.1M parameters. This is substantially smaller than alternatives requiring 300–530M parameters (dual-backbone designs).
6. Prompt-Free OV-EIS and SEAL⁺⁺
SEAL⁺⁺ extends SEAL to prompt-free, generic, spatiotemporal OV-EIS. It augments the backbone with a lightweight, class-agnostic event detector (RVT+FPN). At inference, detected boxes provide automatic prompts for mask decoding and classification. SEAL⁺⁺ is trained with IoU loss, objectness BCE, and regression on DSEC-Detection and achieves:
| Method | AP | AP50 | AP75 | Time (ms) |
|---|---|---|---|---|
| DEOE+Best Baseline | 16.3 | 21.8 | 18.5 | 400 |
| DEOE+SEAL | 16.4 | 22.1 | 18.8 | 28 |
| SEAL⁺⁺ | 17.8 | 23.6 | 19.2 | 24 |
SEAL⁺⁺ outperforms naive detection+segmentation pipelining and prior closed-set active object detection systems, retaining real-time performance.
7. Significance and Future Prospects
SEAL establishes a rigorous formulation for OV-EIS, demonstrating that a single, parameter-efficient backbone can integrate event-based instance segmentation and open-vocabulary mask classification at multiple semantic granularities. Its empirical advances—across accuracy, inference speed, and model parsimoniousness—suggest that OV-EIS is tractable at scale and for real-time robotics applications. The introduction of hierarchical semantic supervision and prompt-free extensions underscores the extensibility and generalizability of the framework. A plausible implication is that future research will build on open-vocabulary approaches for embodied event perception in unconstrained environments (Lee et al., 30 Jan 2026).