-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding
Video temporal grounding (VTG) is a fine-grained video understanding problem that aims to
ground relevant clips in untrimmed videos given natural language queries. Most existing …
ground relevant clips in untrimmed videos given natural language queries. Most existing …
Rethinking clip-based video learners in cross-domain open-vocabulary action recognition
Building upon the impressive success of CLIP (Contrastive Language-Image Pretraining),
recent pioneer works have proposed to adapt the powerful CLIP to video data, leading to …
recent pioneer works have proposed to adapt the powerful CLIP to video data, leading to …
OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition
Due to the resource-intensive nature of training vision-language models on expansive video
data a majority of studies have centered on adapting pre-trained image-language models to …
data a majority of studies have centered on adapting pre-trained image-language models to …
Side4video: Spatial-temporal side network for memory-efficient image-to-video transfer learning
Large pre-trained vision models achieve impressive success in computer vision. However,
fully fine-tuning large models for downstream tasks, particularly in video understanding, can …
fully fine-tuning large models for downstream tasks, particularly in video understanding, can …
Enhanced Motion-Text Alignment for Image-to-Video Transfer Learning
Extending large image-text pre-trained models (eg CLIP) for video understanding has made
significant advancements. To enable the capability of CLIP to perceive dynamic information …
significant advancements. To enable the capability of CLIP to perceive dynamic information …
Rethinking image-to-video adaptation: An object-centric perspective
Image-to-video adaptation seeks to efficiently adapt image models for use in the video
domain. Instead of finetuning the entire image backbone, many image-to-video adaptation …
domain. Instead of finetuning the entire image backbone, many image-to-video adaptation …
Generating action-conditioned prompts for open-vocabulary video action recognition
Exploring open-vocabulary video action recognition is a promising venture, which aims to
recognize previously unseen actions within any arbitrary set of categories. Existing methods …
recognize previously unseen actions within any arbitrary set of categories. Existing methods …
Mote: Reconciling generalization with specialization for visual-language to video knowledge transfer
Transferring visual-language knowledge from large-scale foundation models for video
recognition has proved to be effective. To bridge the domain gap, additional parametric …
recognition has proved to be effective. To bridge the domain gap, additional parametric …
Vlap: Efficient video-language alignment via frame prompting and distilling for video question answering
In this work, we propose an efficient Video-Language Alignment (ViLA) network. Our ViLA
model addresses both efficient frame sampling and effective cross-modal alignment in a …
model addresses both efficient frame sampling and effective cross-modal alignment in a …
Dynamic and compressive adaptation of transformers from images to videos
Recently, the remarkable success of pre-trained Vision Transformers (ViTs) from image-text
matching has sparked an interest in image-to-video adaptation. However, most current …
matching has sparked an interest in image-to-video adaptation. However, most current …