Bridging the gap: A unified video comprehension framework for moment retrieval and highlight detection

Y **ao, Z Luo, Y Liu, Y Ma, H Bian… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Video Moment Retrieval (MR) and Highlight Detection (HD) have attracted
significant attention due to the growing demand for video analysis. Recent approaches treat …

Soc: Semantic-assisted object cluster for referring video object segmentation

Z Luo, Y **ao, Y Liu, S Li, Y Wang… - Advances in …, 2023 - proceedings.neurips.cc
This paper studies referring video object segmentation (RVOS) by boosting video-level
visual-linguistic alignment. Recent approaches model the RVOS task as a sequence …

Etdnet: Efficient transformer-based detection network for surface defect detection

H Zhou, R Yang, R Hu, C Shu… - IEEE transactions on …, 2023 - ieeexplore.ieee.org
Deep learning (DL)-based surface defect detectors play a crucial role in ensuring product
quality during inspection processes. However, accurately and efficiently detecting defects …

MambaTree: Tree Topology is All You Need in State Space Model

Y **ao, L Song, J Wang, S Song… - Advances in Neural …, 2025 - proceedings.neurips.cc
The state space models, employing recursively propagated features, demonstrate strong
representation capabilities comparable to Transformer models and superior efficiency …

Efficient prompt tuning of large vision-language model for fine-grained ship classification

L Lan, F Wang, X Zheng, Z Wang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Remote-sensing fine-grained ship classification (RS-FGSC) poses a significant challenge
due to the high similarity between classes and the limited availability of labeled data, limiting …

Audio-free prompt tuning for language-audio models

Y Li, X Wang, H Liu - ICASSP 2024-2024 IEEE International …, 2024 - ieeexplore.ieee.org
Contrastive Language-Audio Pretraining (CLAP) is pre-trained to associate audio features
with human language, making it a natural zero-shot classifier to recognize unseen sound …

Video object segmentation with dynamic query modulation

H Zhou, R Hu, X Li - 2024 IEEE International Conference on …, 2024 - ieeexplore.ieee.org
Storing intermediate frame segmentations as memory for long-range context modeling,
spatial-temporal memory-based methods have recently showcased impressive results in …

Multimodal Isotropic Neural Architecture with Patch Embedding

H Truchan, E Naumov, R Abedin, G Palmer… - … Conference on Neural …, 2023 - Springer
Patch embedding has been a significant advancement in Transformer-based models,
particularly the Vision Transformer (ViT), as it enables handling larger image sizes and …