Sam 2: Segment anything in images and videos

N Ravi, V Gabeur, YT Hu, R Hu, C Ryali, T Ma… - arxiv preprint arxiv …, 2024 - arxiv.org
We present Segment Anything Model 2 (SAM 2), a foundation model towards solving
promptable visual segmentation in images and videos. We build a data engine, which …

Transformer-based visual segmentation: A survey

X Li, H Ding, H Yuan, W Zhang, J Pang… - IEEE transactions on …, 2024 - ieeexplore.ieee.org
Visual segmentation seeks to partition images, video frames, or point clouds into multiple
segments or groups. This technique has numerous real-world applications, such as …

Putting the object back into video object segmentation

HK Cheng, SW Oh, B Price, JY Lee… - Proceedings of the …, 2024 - openaccess.thecvf.com
We present Cutie a video object segmentation (VOS) network with object-level memory
reading which puts the object representation from memory back into the video object …

Omnitokenizer: A joint image-video tokenizer for visual generation

J Wang, Y Jiang, Z Yuan, B Peng… - Advances in Neural …, 2025 - proceedings.neurips.cc
Tokenizer, serving as a translator to map the intricate visual data into a compact latent
space, lies at the core of visual generative models. Based on the finding that existing …

Omnivid: A generative framework for universal video understanding

J Wang, D Chen, C Luo, B He, L Yuan… - Proceedings of the …, 2024 - openaccess.thecvf.com
The core of video understanding tasks such as recognition captioning and tracking is to
automatically detect objects or actions in a video and analyze their temporal evolution …

Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing

H Fei, S Wu, H Zhang, TS Chua, S Yan - arxiv preprint arxiv:2412.19806, 2024 - arxiv.org
Recent developments of vision large language models (LLMs) have seen remarkable
progress, yet still encounter challenges towards multimodal generalists, such as coarse …

Chatvideo: A tracklet-centric multimodal and versatile video understanding system

J Wang, D Chen, C Luo, X Dai, L Yuan, Z Wu… - arxiv preprint arxiv …, 2023 - arxiv.org
Existing deep video models are limited by specific tasks, fixed input-output spaces, and poor
generalization capabilities, making it difficult to deploy them in real-world scenarios. In this …

Time does tell: Self-supervised time-tuning of dense image representations

M Salehi, E Gavves, CGM Snoek… - Proceedings of the …, 2023 - openaccess.thecvf.com
Spatially dense self-supervised learning is a rapidly growing problem domain with
promising applications for unsupervised segmentation and pretraining for dense …

Exploring pre-trained text-to-video diffusion models for referring video object segmentation

Z Zhu, X Feng, D Chen, J Yuan, C Qiao… - European Conference on …, 2024 - Springer
In this paper, we explore the visual representations produced from a pre-trained text-to-
video (T2V) diffusion model for video understanding tasks. We hypothesize that the latent …

Joint modeling of feature, correspondence, and a compressed memory for video object segmentation

J Zhang, Y Cui, G Wu, L Wang - arxiv preprint arxiv:2308.13505, 2023 - arxiv.org
Current prevailing Video Object Segmentation (VOS) methods usually perform dense
matching between the current and reference frames after extracting their features. One on …