Mote: Reconciling generalization with specialization for visual-language to video knowledge transfer

M Zhu, Z Wang, M Hu, R Dang, X Lin, X Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org
Transferring visual-language knowledge from large-scale foundation models for video
recognition has proved to be effective. To bridge the domain gap, additional parametric …

Affinity3D: Propagating Instance-Level Semantic Affinity for Zero-Shot Point Cloud Semantic Segmentation

H Liu, J Zhuo, C Liang, J Chen, H Ma - Proceedings of the 32nd ACM …, 2024 - dl.acm.org
Zero-shot point cloud semantic segmentation aims to recognize novel classes at the point
level. Previous methods mainly transfer excellent zero-shot generalization capabilities from …

Building a Multi-modal Spatiotemporal Expert for Zero-shot Action Recognition with CLIP

Y Yu, C Cao, Y Zhang, Q Lv, L Min, Y Zhang - arxiv preprint arxiv …, 2024 - arxiv.org
Zero-shot action recognition (ZSAR) requires collaborative multi-modal spatiotemporal
understanding. However, finetuning CLIP directly for ZSAR yields suboptimal performance …

Text-Enhanced Zero-Shot Action Recognition: A Training-Free Approach

M Bosetti, S Zhang, B Liberatori, G Zara, E Ricci… - … Conference on Pattern …, 2025 - Springer
Vision-language models (VLMs) have demonstrated remarkable performance across
various visual tasks, leveraging joint learning of visual and textual representations. While …