Dino-tracker: Taming dino for self-supervised point tracking in a single video

N Tumanyan, A Singer, S Bagon, T Dekel - European Conference on …, 2024 - Springer
We present DINO-Tracker–a new framework for long-term dense tracking in video. The pillar
of our approach is combining test-time training on a single video, with the powerful localized …

Ram: Retrieval-based affordance transfer for generalizable zero-shot robotic manipulation

Y Kuang, J Ye, H Geng, J Mao, C Deng… - arxiv preprint arxiv …, 2024 - arxiv.org
This work proposes a retrieve-and-transfer framework for zero-shot robotic manipulation,
dubbed RAM, featuring generalizability across various objects, environments, and …

Diffusion models and representation learning: A survey

M Fuest, P Ma, M Gui, JS Fischer, VT Hu… - arxiv preprint arxiv …, 2024 - arxiv.org
Diffusion Models are popular generative modeling methods in various vision tasks, attracting
significant attention. They can be considered a unique instance of self-supervised learning …

Improving semantic correspondence with viewpoint-guided spherical maps

O Mariotti, O Mac Aodha… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Recent self-supervised models produce visual features that are not only effective at
encoding image-level but also pixel-level semantics. They have been reported to obtain …

Can Visual Foundation Models Achieve Long-term Point Tracking?

G Aydemir, W **e, F Güney - arxiv preprint arxiv:2408.13575, 2024 - arxiv.org
Large-scale vision foundation models have demonstrated remarkable success across
various tasks, underscoring their robust generalization capabilities. While their proficiency in …

Law of vision representation in mllms

S Yang, B Zhai, Q You, J Yuan, H Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
We present the" Law of Vision Representation" in multimodal large language models
(MLLMs). It reveals a strong correlation between the combination of cross-modal alignment …

Toward a holistic evaluation of robustness in clip models

W Tu, W Deng, T Gedeon - arxiv preprint arxiv:2410.01534, 2024 - arxiv.org
Contrastive Language-Image Pre-training (CLIP) models have shown significant potential,
particularly in zero-shot classification across diverse distribution shifts. Building on existing …

Click to grasp: Zero-shot precise manipulation via visual diffusion descriptors

N Tsagkas, J Rome, S Ramamoorthy… - 2024 IEEE/RSJ …, 2024 - ieeexplore.ieee.org
Precise manipulation that is generalizable across scenes and objects remains a persistent
challenge in robotics. Current approaches for this task heavily depend on having a …

CleanDIFT: Diffusion Features without Noise

N Stracke, SA Baumann, K Bauer, F Fundel… - arxiv preprint arxiv …, 2024 - arxiv.org
Internal features from large-scale pre-trained diffusion models have recently been
established as powerful semantic descriptors for a wide range of downstream tasks. Works …

ProTracker: Probabilistic Integration for Robust and Accurate Point Tracking

T Zhang, C Wang, Z Dou, Q Gao, J Lei, B Chen… - arxiv preprint arxiv …, 2025 - arxiv.org
In this paper, we propose ProTracker, a novel framework for robust and accurate long-term
dense tracking of arbitrary points in videos. The key idea of our method is incorporating …