Moviechat: From dense token to sparse memory for long video understanding
Recently integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …
Transformer-based visual segmentation: A survey
Visual segmentation seeks to partition images, video frames, or point clouds into multiple
segments or groups. This technique has numerous real-world applications, such as …
segments or groups. This technique has numerous real-world applications, such as …
Dvis: Decoupled video instance segmentation framework
Video instance segmentation (VIS) is a critical task with diverse applications, including
autonomous driving and video editing. Existing methods often underperform on complex …
autonomous driving and video editing. Existing methods often underperform on complex …
Improving video segmentation via dynamic anchor queries
Modern video segmentation methods adopt feature transitions between anchor and target
queries to perform cross-frame object association. The smooth feature transitions between …
queries to perform cross-frame object association. The smooth feature transitions between …
General and Task-Oriented Video Segmentation
We present GvSeg, ag eneral v ideo seg mentation framework for addressing four different
video segmentation tasks (ie., instance, semantic, panoptic, and exemplar-guided) while …
video segmentation tasks (ie., instance, semantic, panoptic, and exemplar-guided) while …
Unified embedding alignment for open-vocabulary video instance segmentation
Abstract Open-Vocabulary Video Instance Segmentation (VIS) is attracting increasing
attention due to its ability to segment and track arbitrary objects. However, the recent Open …
attention due to its ability to segment and track arbitrary objects. However, the recent Open …
Language-driven visual consensus for zero-shot semantic segmentation
The pre-trained vision-language model, exemplified by CLIP [1], advances zero-shot
semantic segmentation by aligning visual features with class embeddings through a …
semantic segmentation by aligning visual features with class embeddings through a …
Dvis++: Improved decoupled framework for universal video segmentation
We present the\textbf {D} ecoupled\textbf {VI} deo\textbf {S} egmentation (DVIS) framework, a
novel approach for the challenging task of universal video segmentation, including video …
novel approach for the challenging task of universal video segmentation, including video …
ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models
Abstract Video Temporal Grounding (VTG) aims to ground specific segments within an
untrimmed video corresponding to the given natural language query. Existing VTG methods …
untrimmed video corresponding to the given natural language query. Existing VTG methods …
[HTML][HTML] Scale-aware token-matching for transformer-based object detector
Owing to the advancements in deep learning, object detection has made significant
progress in estimating the positions and classes of multiple objects within an image …
progress in estimating the positions and classes of multiple objects within an image …