SEAL logo

Segment Any Events with Language

ICLR 2026

Department of Computer Science, National University of Singapore

SEAL is the first Semantic-aware Segment Any Events model!


Given the box prompts, our SEAL output both instance-level mask and open-vocabulary semantics.


Given a point prompt, our SEAL recognizes and classifies both instance-level and part-level masks.

Class-agnostic object detection

I wanna drive.

Walking in the sidewalk.


Based on our SEAL, we further propose SEAL++ that supports 1) prompt-free, 2) spatiotemporal, Open-Vocabulary Events Instance Segmentation (OV-EIS). Left video shows the class-agnostic detection from SEAL++. Two videos on the right present the spatiotemporal OV-EIS.

Scene 1.

Scene 2.


Our SEAL++ also supports spatiotemporal OV-EIS with multiple semantics.
In the above demo, green color indicates Vehicle class and white color indicates Human class.

Abstract

Scene understanding with free-form language has been widely explored within diverse modalities such as images, point clouds, and LiDAR. However, related studies on event sensors are scarce or narrowly centered on semantic-level understanding. We introduce SEAL, the first Semantic-aware Segment Any Events framework that addresses Open-Vocabulary Event Instance Segmentation (OV-EIS). Given the visual prompt, our model presents a unified framework to support both event segmentation and open-vocabulary mask classification at multiple levels of granularity, including instance-level and part-level. To enable thorough evaluation on OV-EIS, we curate four benchmarks that cover label granularity from coarse to fine class configurations and semantic granularity from instance-level to part-level understanding. Extensive experiments show that our SEAL largely outperforms proposed baselines in terms of performance and inference speed with a parameter-efficient architecture. In the Appendix, we further present a simple variant of our SEAL achieving generic spatiotemporal OV-EIS that does not require any visual prompts from users in the inference. The code will be publicly available.

Contributions


  1. For the first time, we address Open-Vocabulary Event Instance Segmentatino (OV-EIS) problem, an important yet largely overlooked research challenge.
  2. We propose four types of simple baselines that enable OV-EIS and discuss their limitations.
  3. Given the limitationsof the proposed baselines, we introduce SEAL, an advanced OV-EIS framework that achieves the best performance and fastest real-time inference speed with a lightweight architecture.
  4. We propose four new benchmarks to thoroughly evaluate the OV-EIS with diverse scenarios.

Proposed Baselines


Four Types of Baseline under Three Categories. The three categories (AR-CDG, Hybrid, AF-DA) are designed as a baseline based on the strategy of transferring image-domain knowledge to the event domain. Refer to Sec. 3 and Supplementary material Sec. A.3 for further details. Those proposed baselines have two limitations: 1) AR-CDG and Hybrid categories still suffer from a huge domain gap between image and events despite the E2VID reconstruction. 2) All baselines require two distinct backbones for mask generation and classification which degrades parameter and inference efficiency.

Overall Framework of SEAL


Overall framework of SEAL. The MHSG module (Red and Green) provides rich multimodal semantic guidance across multiple levels of granularity including part-level and instance-level. The multimodal fusion network of SEAL (Purple) enhances class prediction from a given binary mask by encoding rich semantic and spatial priors to produce a CLIP-aligned mask feature.

Qualitative Results


Given multiple bounding boxes that cover all of the objects in the current frame, our SEAL is asked to classify the correct instance masks accroding to the given free-form language query. Our visualization results demonstrate that our SEAL can recognize the diverse instances in response to both noun-level and sentence-level queries.

Given a single point prompt, our SEAL is asked to classify the generated mask in the multiple levels of granularity. Since the mask generator of SEAL follows the architecture of SAM, it generates three masks with different granularity according to the single point prompt. We use two of them for the visualization, the coarsest and the finest. Figures below show that our SEAL can recognize the objects at both instance-level and part-level granularities.

BibTeX

@article{lee2026segment,
  title={Segment Any Events with Language},
  author={Lee, Seungjun and Lee, Gim Hee},
  journal={arXiv preprint arXiv:2601.23159},
  year={2026}
}