-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

Y Liu, J He, W Li, J Kim, D Wei, H Pfister… - European Conference on …, 2024 - Springer
Video temporal grounding (VTG) is a fine-grained video understanding problem that aims to
ground relevant clips in untrimmed videos given natural language queries. Most existing …

Rethinking clip-based video learners in cross-domain open-vocabulary action recognition

KY Lin, H Ding, J Zhou, YM Tang, YX Peng… - arxiv preprint arxiv …, 2024 - arxiv.org
Building upon the impressive success of CLIP (Contrastive Language-Image Pretraining),
recent pioneer works have proposed to adapt the powerful CLIP to video data, leading to …

OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

T Chen, H Yu, Z Yang, Z Li, W Sun… - Proceedings of the …, 2024 - openaccess.thecvf.com
Due to the resource-intensive nature of training vision-language models on expansive video
data a majority of studies have centered on adapting pre-trained image-language models to …

Side4video: Spatial-temporal side network for memory-efficient image-to-video transfer learning

H Yao, W Wu, Z Li - arxiv preprint arxiv:2311.15769, 2023 - arxiv.org
Large pre-trained vision models achieve impressive success in computer vision. However,
fully fine-tuning large models for downstream tasks, particularly in video understanding, can …

Enhanced Motion-Text Alignment for Image-to-Video Transfer Learning

W Zhang, C Wan, T Liu, X Tian… - Proceedings of the …, 2024 - openaccess.thecvf.com
Extending large image-text pre-trained models (eg CLIP) for video understanding has made
significant advancements. To enable the capability of CLIP to perceive dynamic information …

Rethinking image-to-video adaptation: An object-centric perspective

R Qian, S Ding, D Lin - European Conference on Computer Vision, 2024 - Springer
Image-to-video adaptation seeks to efficiently adapt image models for use in the video
domain. Instead of finetuning the entire image backbone, many image-to-video adaptation …

Generating action-conditioned prompts for open-vocabulary video action recognition

C Jia, M Luo, X Chang, Z Dang, M Han… - Proceedings of the …, 2024 - dl.acm.org
Exploring open-vocabulary video action recognition is a promising venture, which aims to
recognize previously unseen actions within any arbitrary set of categories. Existing methods …

Mote: Reconciling generalization with specialization for visual-language to video knowledge transfer

M Zhu, Z Wang, M Hu, R Dang, X Lin, X Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org
Transferring visual-language knowledge from large-scale foundation models for video
recognition has proved to be effective. To bridge the domain gap, additional parametric …

Vlap: Efficient video-language alignment via frame prompting and distilling for video question answering

X Wang, J Liang, CK Wang, K Deng, Y Lou, MC Lin… - CoRR, 2023 - openreview.net
In this work, we propose an efficient Video-Language Alignment (ViLA) network. Our ViLA
model addresses both efficient frame sampling and effective cross-modal alignment in a …

Dynamic and compressive adaptation of transformers from images to videos

G Zhang, J Liu, S Cao, X Zhao, K Zhao, K Ma… - arxiv preprint arxiv …, 2024 - arxiv.org
Recently, the remarkable success of pre-trained Vision Transformers (ViTs) from image-text
matching has sparked an interest in image-to-video adaptation. However, most current …