Bridging the gap: A unified video comprehension framework for moment retrieval and highlight detection
Abstract Video Moment Retrieval (MR) and Highlight Detection (HD) have attracted
significant attention due to the growing demand for video analysis. Recent approaches treat …
significant attention due to the growing demand for video analysis. Recent approaches treat …
Open-vocabulary segmentation with semantic-assisted calibration
This paper studies open-vocabulary segmentation (OVS) through calibrating in-vocabulary
and domain-biased embedding space with generalized contextual prior of CLIP. As the core …
and domain-biased embedding space with generalized contextual prior of CLIP. As the core …
Universal segmentation at arbitrary granularity with language instruction
This paper aims to achieve universal segmentation of arbitrary semantic level. Despite
significant progress in recent years specialist segmentation approaches are limited to …
significant progress in recent years specialist segmentation approaches are limited to …
Decoupling static and hierarchical motion perception for referring video segmentation
Referring video segmentation relies on natural language expressions to identify and
segment objects often emphasizing motion clues. Previous works treat a sentence as a …
segment objects often emphasizing motion clues. Previous works treat a sentence as a …
Towards noise-tolerant speech-referring video object segmentation: Bridging speech and text
Linguistic communication is prevalent in Human-Computer Interaction (HCI). Speech
(spoken language) serves as a convenient yet potentially ambiguous form due to noise and …
(spoken language) serves as a convenient yet potentially ambiguous form due to noise and …
Losh: Long-short text joint prediction network for referring video object segmentation
Referring video object segmentation (RVOS) aims to segment the target instance referred by
a given text expression in a video clip. The text expression normally contains sophisticated …
a given text expression in a video clip. The text expression normally contains sophisticated …
Temporally consistent referring video object segmentation with hybrid memory
Referring Video Object Segmentation (R-VOS) methods face challenges in maintaining
consistent object segmentation due to temporal context variability and the presence of other …
consistent object segmentation due to temporal context variability and the presence of other …
Explicit Granularity and Implicit Scale Correspondence Learning for Point-Supervised Video Moment Localization
Video moment localization (VML) aims to identify the temporal boundary semantically
matching the given query. Point-supervised VML balances localization accuracy and …
matching the given query. Point-supervised VML balances localization accuracy and …
Efficient prompt tuning of large vision-language model for fine-grained ship classification
Remote-sensing fine-grained ship classification (RS-FGSC) poses a significant challenge
due to the high similarity between classes and the limited availability of labeled data, limiting …
due to the high similarity between classes and the limited availability of labeled data, limiting …
Cross-modal cognitive consensus guided audio-visual segmentation
Audio-Visual Segmentation (AVS) aims to extract the sounding object from a video frame,
which is represented by a pixel-wise segmentation mask for application scenarios such as …
which is represented by a pixel-wise segmentation mask for application scenarios such as …