Mote: Reconciling generalization with specialization for visual-language to video knowledge transfer
Transferring visual-language knowledge from large-scale foundation models for video
recognition has proved to be effective. To bridge the domain gap, additional parametric …
recognition has proved to be effective. To bridge the domain gap, additional parametric …
Affinity3D: Propagating Instance-Level Semantic Affinity for Zero-Shot Point Cloud Semantic Segmentation
Zero-shot point cloud semantic segmentation aims to recognize novel classes at the point
level. Previous methods mainly transfer excellent zero-shot generalization capabilities from …
level. Previous methods mainly transfer excellent zero-shot generalization capabilities from …
Building a Multi-modal Spatiotemporal Expert for Zero-shot Action Recognition with CLIP
Zero-shot action recognition (ZSAR) requires collaborative multi-modal spatiotemporal
understanding. However, finetuning CLIP directly for ZSAR yields suboptimal performance …
understanding. However, finetuning CLIP directly for ZSAR yields suboptimal performance …
Text-Enhanced Zero-Shot Action Recognition: A Training-Free Approach
Vision-language models (VLMs) have demonstrated remarkable performance across
various visual tasks, leveraging joint learning of visual and textual representations. While …
various visual tasks, leveraging joint learning of visual and textual representations. While …