- Academic Search

Mote: Reconciling generalization with specialization for visual-language to video knowledge transfer

M Zhu, Z Wang, M Hu, R Dang, X Lin, X Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org

Transferring visual-language knowledge from large-scale foundation models for video
recognition has proved to be effective. To bridge the domain gap, additional parametric …

Save Cite Cited by 2 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] openreview.net

Affinity3D: Propagating Instance-Level Semantic Affinity for Zero-Shot Point Cloud Semantic Segmentation

H Liu, J Zhuo, C Liang, J Chen, H Ma - Proceedings of the 32nd ACM …, 2024 - dl.acm.org

Zero-shot point cloud semantic segmentation aims to recognize novel classes at the point
level. Previous methods mainly transfer excellent zero-shot generalization capabilities from …

[Free GPT-4]

[PDF] arxiv.org

Building a Multi-modal Spatiotemporal Expert for Zero-shot Action Recognition with CLIP

Y Yu, C Cao, Y Zhang, Q Lv, L Min, Y Zhang - arxiv preprint arxiv …, 2024 - arxiv.org

Zero-shot action recognition (ZSAR) requires collaborative multi-modal spatiotemporal
understanding. However, finetuning CLIP directly for ZSAR yields suboptimal performance …

[Free GPT-4]

[PDF] arxiv.org

Text-Enhanced Zero-Shot Action Recognition: A Training-Free Approach

M Bosetti, S Zhang, B Liberatori, G Zara, E Ricci… - … Conference on Pattern …, 2025 - Springer

Vision-language models (VLMs) have demonstrated remarkable performance across
various visual tasks, leveraging joint learning of visual and textual representations. While …

Create alert

Cite

Advanced search

Saved to My library

Orthogonal Temporal Interpolation for Zero-Shot Video Recognition

Mote: Reconciling generalization with specialization for visual-language to video knowledge transfer

Affinity3D: Propagating Instance-Level Semantic Affinity for Zero-Shot Point Cloud Semantic Segmentation

Building a Multi-modal Spatiotemporal Expert for Zero-shot Action Recognition with CLIP

Text-Enhanced Zero-Shot Action Recognition: A Training-Free Approach