Papers
Topics
Authors
Recent
Search
2000 character limit reached

OV-EIS: Open-Vocabulary Event Segmentation

Updated 6 February 2026
  • OV-EIS is a novel task that segments spatiotemporal event data and assigns language-based open-set labels for flexible semantic scene interpretation.
  • The SEAL framework integrates an EventSAM backbone with multimodal fusion and hierarchical semantic guidance to achieve state-of-the-art segmentation and classification metrics.
  • Extensions such as SEAL++ enable prompt-free, real-time detection and segmentation, reducing computational costs while enhancing performance in robotic applications.

Open-Vocabulary Event Instance Segmentation (OV-EIS) denotes the task of segmenting instances in event camera streams and assigning open-set, language-grounded labels to each predicted mask. In OV-EIS, the system receives spatiotemporal event data from event-based sensors and a set of arbitrary text queries describing potential categories or parts. The objective is to produce segmentation masks delineating individual event instances and to select, for each instance, a semantically consistent label from the potentially unbounded vocabulary of input queries. This is performed without restricting classification to a fixed, pre-defined taxonomy, enabling both semantic and part-level scene understanding at arbitrary levels of granularity (Lee et al., 30 Jan 2026).

1. Formal Task Definition and Notation

Given a stream from an event camera as e={(xi,yi,ti,pi)}i=1Ne = \{(x_i, y_i, t_i, p_i)\}_{i=1}^N, where (xi,yi)(x_i, y_i) are spatial coordinates, tit_i is the timestamp, and pi{+1,1}p_i \in \{+1, -1\} represents polarity, events are aggregated into a regular representation IevtRC×H×WI_{\rm evt} \in \mathbb{R}^{C \times H \times W} (commonly using spatiotemporal voxel grids, frame reconstructions, or spike encoding). The system is provided with a set of free-form language queries {qj}j=1M\{q_j\}_{j=1}^M and visual prompts PP (points or boxes). The OV-EIS task is to predict binary masks {Mk{0,1}H×W}k=1K\{M_k \in \{0, 1\}^{H \times W}\}_{k=1}^K for each instance and assign a label ckc_k from the open vocabulary of queries.

The functional pipeline is as follows:

  • M=MaskGen(Ievt,P)M = \text{MaskGen}(I_{\rm evt}, P) (class-agnostic mask generation)
  • sk=ClassScore(MaskFeat(Mk,Ievt),TextEmbed({qj}))s_k = \text{ClassScore}(\text{MaskFeat}(M_k, I_{\rm evt}), \text{TextEmbed}(\{q_j\})) (similarity scoring)
  • ck=argmaxjsk[j]c_k = \arg\max_j s_k[j] (selecting the top scoring query for each segment)

The core requirement of OV-EIS is that mask generation is class-agnostic, and label assignment leverages open-vocabulary classification through cross-modal feature alignment.

2. The SEAL Framework for OV-EIS

The SEAL ("Semantic-aware Segment Any Events with Language") architecture is designed to unify event-domain instance segmentation and open-vocabulary classification via a parameter-efficient model. The main components are:

  • Input Representation: Events are discretized into voxel grids (IevtR3×H×WI_{\rm evt} \in \mathbb{R}^{3 \times H \times W} with three temporal bins), sampled over fixed windows (e.g., 25 ms for DSEC, 15 ms for DDD17). SEAL does not require frame-level reconstruction at inference and operates directly on event voxel grids for computational efficiency.
  • Backbone: SEAL uses a ViT-B-based EventSAM backbone (FevtF_{\rm evt}), pretrained and then frozen, to encode IevtI_{\rm evt} into a token map TevtRD×H/32×W/32T_{\rm evt} \in \mathbb{R}^{D \times H/32 \times W/32}. The mask decoder, adapted from SAM, generates class-agnostic binary masks {Mk}\{M_k\} based on visual prompts PP.
  • Multimodal Hierarchical Semantic Guidance (MHSG): During training, hierarchical supervision is derived from paired RGB images and captions generated at three semantic levels: semantic, instance, and part. Visual guidance employs SAM to segment images; CLIP is used to extract visual and text features for alignment.
  • Fusion Network: A multimodal fusion network is layered atop the backbone and consists of:
    • a backbone feature enhancer (Transformer layers with cross-attention to text anchors),
    • a spatial encoding module combining mask tokens and pooled semantic features,
    • and a mask feature enhancer using cross-attention masked to the predicted region.
  • Classification Head: For each candidate mask, cosine similarity is computed between the mask’s feature embedding and embeddings of user-provided queries, with the maximum determining the mask’s label.

3. Training Paradigm and Loss Functions

SEAL employs a two-stage training regime:

  • Stage 1: EventSAM pretraining via distillation from SAM image features, using mixed event-image pairs (no human annotations). The backbone is pretrained using cross-entropy and Dice losses.
  • Stage 2: Multimodal fusion training (MHSG) aligns event-based mask features to both CLIP-derived visual features and text features across three levels (semantic, instance, part). The objective is to minimize

Ldistill=l{s,i,p}k=1Kl[(1cos(ml,ke,vl,kI))+α(1cos(ml,ke,vl,kT))]\mathcal{L}_{\rm distill} = \sum_{l\in\{s,i,p\}} \sum_{k=1}^{K_l} \left[ (1-\cos(m_{l,k}^e, v_{l,k}^I)) + \alpha (1-\cos(m_{l,k}^e, v_{l,k}^T)) \right]

where cos(a,b)\cos(a, b) is the cosine similarity, α\alpha weights text/visual alignment, and KlK_l is the number of masks at level ll. After stage 1, the EventSAM backbone and decoder are frozen; only fusion modules and projection layers are updated in stage 2 (Lee et al., 30 Jan 2026).

4. Benchmark Datasets and Evaluation Protocols

Four OV-EIS benchmarks were curated from the DDD17-Seg and DSEC-Semantic datasets by generating SAM masks and assigning them semantic or part-level labels. Benchmarks differ in class granularity, sequence count, spatial resolution, and segmentation level:

Benchmark Classes Events/Seq Resolution Levels
DDD17-Ins 5 3,890 352×200 semantic + instance
DSEC11-Ins 7 2,809 640×440 semantic + instance
DSEC19-Ins 14 2,809 640×440 semantic + instance
DSEC-Part 9 (parts) 2,809 640×440 part-level only

The principal metric is Average Precision (AP), computed as the mean of AP@τ over τ{0.5:0.05:0.95}\tau \in \{0.5 : 0.05 : 0.95\} (area under the precision-recall curve at each IoU threshold). Auxiliary reporting includes AP50 and AP25.

5. Empirical Performance and Efficiency

SEAL achieves state-of-the-art results compared to prior open-vocabulary or event segmentation methods. On the DDD17-Ins, DSEC11-Ins, DSEC19-Ins, and DSEC-Part benchmarks, SEAL demonstrates gains in AP, AP50, and AP25 over all provided baselines, as summarized:

Benchmark Metric Best Baseline SEAL Δ Time (ms) Params (M)
DDD17-Ins AP 29.8 (OpenSeg) 32.3 +2.5 22.3 99.1
DSEC11-Ins AP50 40.3 (frame2recon) 55.1 +14.8
DSEC19-Ins AP25 26.9 (frame2recon) 36.3 +9.4
DSEC-Part AP 12.9 (VLPart) 13.6 +0.7

SEAL attains 45 FPS inference speed (22 ms per frame) on a single RTX 6000 Ada, with a memory footprint of 99.1M parameters. This is substantially smaller than alternatives requiring 300–530M parameters (dual-backbone designs).

6. Prompt-Free OV-EIS and SEAL⁺⁺

SEAL⁺⁺ extends SEAL to prompt-free, generic, spatiotemporal OV-EIS. It augments the backbone with a lightweight, class-agnostic event detector (RVT+FPN). At inference, detected boxes provide automatic prompts for mask decoding and classification. SEAL⁺⁺ is trained with IoU loss, objectness BCE, and regression on DSEC-Detection and achieves:

Method AP AP50 AP75 Time (ms)
DEOE+Best Baseline 16.3 21.8 18.5 400
DEOE+SEAL 16.4 22.1 18.8 28
SEAL⁺⁺ 17.8 23.6 19.2 24

SEAL⁺⁺ outperforms naive detection+segmentation pipelining and prior closed-set active object detection systems, retaining real-time performance.

7. Significance and Future Prospects

SEAL establishes a rigorous formulation for OV-EIS, demonstrating that a single, parameter-efficient backbone can integrate event-based instance segmentation and open-vocabulary mask classification at multiple semantic granularities. Its empirical advances—across accuracy, inference speed, and model parsimoniousness—suggest that OV-EIS is tractable at scale and for real-time robotics applications. The introduction of hierarchical semantic supervision and prompt-free extensions underscores the extensibility and generalizability of the framework. A plausible implication is that future research will build on open-vocabulary approaches for embodied event perception in unconstrained environments (Lee et al., 30 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Open-Vocabulary Event Instance Segmentation (OV-EIS).