Four Types of Baseline under Three Categories. The three categories (AR-CDG, Hybrid, AF-DA) are designed as a baseline based on the strategy of transferring image-domain knowledge to the event domain. Refer to Sec. 3 and Supplementary material Sec. A.3 for further details. Those proposed baselines have two limitations: 1) AR-CDG and Hybrid categories still suffer from a huge domain gap between image and events despite the E2VID reconstruction. 2) All baselines require two distinct backbones for mask generation and classification which degrades parameter and inference efficiency.
Overall framework of SEAL. The MHSG module (Red and Green) provides rich multimodal semantic guidance across multiple levels of granularity including part-level and instance-level. The multimodal fusion network of SEAL (Purple) enhances class prediction from a given binary mask by encoding rich semantic and spatial priors to produce a CLIP-aligned mask feature.
Given multiple bounding boxes that cover all of the objects in the current frame, our SEAL is asked to classify the correct instance masks accroding to the given free-form language query. Our visualization results demonstrate that our SEAL can recognize the diverse instances in response to both noun-level and sentence-level queries.















Given a single point prompt, our SEAL is asked to classify the generated mask in the multiple levels of granularity. Since the mask generator of SEAL follows the architecture of SAM, it generates three masks with different granularity according to the single point prompt. We use two of them for the visualization, the coarsest and the finest. Figures below show that our SEAL can recognize the objects at both instance-level and part-level granularities.


















@article{lee2026segment,
title={Segment Any Events with Language},
author={Lee, Seungjun and Lee, Gim Hee},
journal={arXiv preprint arXiv:2601.23159},
year={2026}
}