OV-EIS: Open-Vocabulary Event Segmentation

Updated 6 February 2026

OV-EIS is a novel task that segments spatiotemporal event data and assigns language-based open-set labels for flexible semantic scene interpretation.
The SEAL framework integrates an EventSAM backbone with multimodal fusion and hierarchical semantic guidance to achieve state-of-the-art segmentation and classification metrics.
Extensions such as SEAL++ enable prompt-free, real-time detection and segmentation, reducing computational costs while enhancing performance in robotic applications.

Open-Vocabulary Event Instance Segmentation (OV-EIS) denotes the task of segmenting instances in event camera streams and assigning open-set, language-grounded labels to each predicted mask. In OV-EIS, the system receives spatiotemporal event data from event-based sensors and a set of arbitrary text queries describing potential categories or parts. The objective is to produce segmentation masks delineating individual event instances and to select, for each instance, a semantically consistent label from the potentially unbounded vocabulary of input queries. This is performed without restricting classification to a fixed, pre-defined taxonomy, enabling both semantic and part-level scene understanding at arbitrary levels of granularity (Lee et al., 30 Jan 2026).

1. Formal Task Definition and Notation

Given a stream from an event camera as $e = \{(x_i, y_i, t_i, p_i)\}_{i=1}^N$ , where $(x_i, y_i)$ are spatial coordinates, $t_i$ is the timestamp, and $p_i \in \{+1, -1\}$ represents polarity, events are aggregated into a regular representation $I_{\rm evt} \in \mathbb{R}^{C \times H \times W}$ (commonly using spatiotemporal voxel grids, frame reconstructions, or spike encoding). The system is provided with a set of free-form language queries $\{q_j\}_{j=1}^M$ and visual prompts $P$ (points or boxes). The OV-EIS task is to predict binary masks $\{M_k \in \{0, 1\}^{H \times W}\}_{k=1}^K$ for each instance and assign a label $c_k$ from the open vocabulary of queries.

The functional pipeline is as follows:

$M = \text{MaskGen}(I_{\rm evt}, P)$ (class-agnostic mask generation)
$s_k = \text{ClassScore}(\text{MaskFeat}(M_k, I_{\rm evt}), \text{TextEmbed}(\{q_j\}))$ (similarity scoring)
$c_k = \arg\max_j s_k[j]$ (selecting the top scoring query for each segment)

The core requirement of OV-EIS is that mask generation is class-agnostic, and label assignment leverages open-vocabulary classification through cross-modal feature alignment.

2. The SEAL Framework for OV-EIS

The SEAL ("Semantic-aware Segment Any Events with Language") architecture is designed to unify event-domain instance segmentation and open-vocabulary classification via a parameter-efficient model. The main components are:

Input Representation: Events are discretized into voxel grids ( $I_{\rm evt} \in \mathbb{R}^{3 \times H \times W}$ with three temporal bins), sampled over fixed windows (e.g., 25 ms for DSEC, 15 ms for DDD17). SEAL does not require frame-level reconstruction at inference and operates directly on event voxel grids for computational efficiency.
Backbone: SEAL uses a ViT-B-based EventSAM backbone ( $F_{\rm evt}$ ), pretrained and then frozen, to encode $I_{\rm evt}$ into a token map $T_{\rm evt} \in \mathbb{R}^{D \times H/32 \times W/32}$ . The mask decoder, adapted from SAM, generates class-agnostic binary masks $\{M_k\}$ based on visual prompts $P$ .
Multimodal Hierarchical Semantic Guidance (MHSG): During training, hierarchical supervision is derived from paired RGB images and captions generated at three semantic levels: semantic, instance, and part. Visual guidance employs SAM to segment images; CLIP is used to extract visual and text features for alignment.
Fusion Network: A multimodal fusion network is layered atop the backbone and consists of:
- a backbone feature enhancer (Transformer layers with cross-attention to text anchors),
- a spatial encoding module combining mask tokens and pooled semantic features,
- and a mask feature enhancer using cross-attention masked to the predicted region.
Classification Head: For each candidate mask, cosine similarity is computed between the mask’s feature embedding and embeddings of user-provided queries, with the maximum determining the mask’s label.

3. Training Paradigm and Loss Functions

SEAL employs a two-stage training regime:

Stage 1: EventSAM pretraining via distillation from SAM image features, using mixed event-image pairs (no human annotations). The backbone is pretrained using cross-entropy and Dice losses.
Stage 2: Multimodal fusion training (MHSG) aligns event-based mask features to both CLIP-derived visual features and text features across three levels (semantic, instance, part). The objective is to minimize

$\mathcal{L}_{\rm distill} = \sum_{l\in\{s,i,p\}} \sum_{k=1}^{K_l} \left[ (1-\cos(m_{l,k}^e, v_{l,k}^I)) + \alpha (1-\cos(m_{l,k}^e, v_{l,k}^T)) \right]$

where $\cos(a, b)$ is the cosine similarity, $\alpha$ weights text/visual alignment, and $K_l$ is the number of masks at level $l$ . After stage 1, the EventSAM backbone and decoder are frozen; only fusion modules and projection layers are updated in stage 2 (Lee et al., 30 Jan 2026).

4. Benchmark Datasets and Evaluation Protocols

Four OV-EIS benchmarks were curated from the DDD17-Seg and DSEC-Semantic datasets by generating SAM masks and assigning them semantic or part-level labels. Benchmarks differ in class granularity, sequence count, spatial resolution, and segmentation level:

Benchmark	Classes	Events/Seq	Resolution	Levels
DDD17-Ins	5	3,890	352×200	semantic + instance
DSEC11-Ins	7	2,809	640×440	semantic + instance
DSEC19-Ins	14	2,809	640×440	semantic + instance
DSEC-Part	9 (parts)	2,809	640×440	part-level only

The principal metric is Average Precision (AP), computed as the mean of AP@τ over $\tau \in \{0.5 : 0.05 : 0.95\}$ (area under the precision-recall curve at each IoU threshold). Auxiliary reporting includes AP50 and AP25.

5. Empirical Performance and Efficiency

SEAL achieves state-of-the-art results compared to prior open-vocabulary or event segmentation methods. On the DDD17-Ins, DSEC11-Ins, DSEC19-Ins, and DSEC-Part benchmarks, SEAL demonstrates gains in AP, AP50, and AP25 over all provided baselines, as summarized:

Benchmark	Metric	Best Baseline	SEAL	Δ	Time (ms)	Params (M)
DDD17-Ins	AP	29.8 (OpenSeg)	32.3	+2.5	22.3	99.1
DSEC11-Ins	AP50	40.3 (frame2recon)	55.1	+14.8	–	–
DSEC19-Ins	AP25	26.9 (frame2recon)	36.3	+9.4	–	–
DSEC-Part	AP	12.9 (VLPart)	13.6	+0.7	–	–

SEAL attains 45 FPS inference speed (22 ms per frame) on a single RTX 6000 Ada, with a memory footprint of 99.1M parameters. This is substantially smaller than alternatives requiring 300–530M parameters (dual-backbone designs).

6. Prompt-Free OV-EIS and SEAL⁺⁺

SEAL⁺⁺ extends SEAL to prompt-free, generic, spatiotemporal OV-EIS. It augments the backbone with a lightweight, class-agnostic event detector (RVT+FPN). At inference, detected boxes provide automatic prompts for mask decoding and classification. SEAL⁺⁺ is trained with IoU loss, objectness BCE, and regression on DSEC-Detection and achieves:

Method	AP	AP50	AP75	Time (ms)
DEOE+Best Baseline	16.3	21.8	18.5	400
DEOE+SEAL	16.4	22.1	18.8	28
SEAL⁺⁺	17.8	23.6	19.2	24

SEAL⁺⁺ outperforms naive detection+segmentation pipelining and prior closed-set active object detection systems, retaining real-time performance.

7. Significance and Future Prospects

SEAL establishes a rigorous formulation for OV-EIS, demonstrating that a single, parameter-efficient backbone can integrate event-based instance segmentation and open-vocabulary mask classification at multiple semantic granularities. Its empirical advances—across accuracy, inference speed, and model parsimoniousness—suggest that OV-EIS is tractable at scale and for real-time robotics applications. The introduction of hierarchical semantic supervision and prompt-free extensions underscores the extensibility and generalizability of the framework. A plausible implication is that future research will build on open-vocabulary approaches for embodied event perception in unconstrained environments (Lee et al., 30 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Segment Any Events with Language (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Open-Vocabulary Event Instance Segmentation (OV-EIS).

OV-EIS: Open-Vocabulary Event Segmentation

1. Formal Task Definition and Notation

2. The SEAL Framework for OV-EIS

3. Training Paradigm and Loss Functions

4. Benchmark Datasets and Evaluation Protocols

5. Empirical Performance and Efficiency

6. Prompt-Free OV-EIS and SEAL⁺⁺

7. Significance and Future Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

OV-EIS: Open-Vocabulary Event Segmentation

1. Formal Task Definition and Notation

2. The SEAL Framework for OV-EIS

3. Training Paradigm and Loss Functions

4. Benchmark Datasets and Evaluation Protocols

5. Empirical Performance and Efficiency

6. Prompt-Free OV-EIS and SEAL⁺⁺

7. Significance and Future Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research