Moviechat: From dense token to sparse memory for long video understanding

E Song, W Chai, G Wang, Y Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recently integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …

Transformer-based visual segmentation: A survey

X Li, H Ding, H Yuan, W Zhang, J Pang… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Visual segmentation seeks to partition images, video frames, or point clouds into multiple
segments or groups. This technique has numerous real-world applications, such as …

Dvis: Decoupled video instance segmentation framework

T Zhang, X Tian, Y Wu, S Ji, X Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Video instance segmentation (VIS) is a critical task with diverse applications, including
autonomous driving and video editing. Existing methods often underperform on complex …

Improving video segmentation via dynamic anchor queries

Y Zhou, T Zhang, S Ji, S Yan, X Li - European Conference on Computer …, 2024 - Springer
Modern video segmentation methods adopt feature transitions between anchor and target
queries to perform cross-frame object association. The smooth feature transitions between …

General and Task-Oriented Video Segmentation

M Chen, L Li, W Wang, R Quan, Y Yang - European Conference on …, 2024 - Springer
We present GvSeg, ag eneral v ideo seg mentation framework for addressing four different
video segmentation tasks (ie., instance, semantic, panoptic, and exemplar-guided) while …

Unified embedding alignment for open-vocabulary video instance segmentation

H Fang, P Wu, Y Li, X Zhang, X Lu - European Conference on Computer …, 2024 - Springer
Abstract Open-Vocabulary Video Instance Segmentation (VIS) is attracting increasing
attention due to its ability to segment and track arbitrary objects. However, the recent Open …

Language-driven visual consensus for zero-shot semantic segmentation

Z Zhang, W Ke, Y Zhu, X Liang, J Liu… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
The pre-trained vision-language model, exemplified by CLIP [1], advances zero-shot
semantic segmentation by aligning visual features with class embeddings through a …

Dvis++: Improved decoupled framework for universal video segmentation

T Zhang, X Tian, Y Zhou, S Ji, X Wang, X Tao… - arxiv preprint arxiv …, 2023 - arxiv.org
We present the\textbf {D} ecoupled\textbf {VI} deo\textbf {S} egmentation (DVIS) framework, a
novel approach for the challenging task of universal video segmentation, including video …

ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models

M Qu, X Chen, W Liu, A Li… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Abstract Video Temporal Grounding (VTG) aims to ground specific segments within an
untrimmed video corresponding to the given natural language query. Existing VTG methods …

[HTML][HTML] Scale-aware token-matching for transformer-based object detector

A Jung, S Hong, Y Hyun - Pattern Recognition Letters, 2024 - Elsevier
Owing to the advancements in deep learning, object detection has made significant
progress in estimating the positions and classes of multiple objects within an image …