- Academic Search

Y Liu, J He, W Li, J Kim, D Wei, H Pfister… - European Conference on …, 2024 - Springer

Video temporal grounding (VTG) is a fine-grained video understanding problem that aims to
ground relevant clips in untrimmed videos given natural language queries. Most existing …

保存引用被引用次数：8 相关文章所有 2 个版本

[Free GPT-4]

[PDF] arxiv.org

Rethinking clip-based video learners in cross-domain open-vocabulary action recognition

KY Lin, H Ding, J Zhou, YM Tang, YX Peng… - arxiv preprint arxiv …, 2024 - arxiv.org

Building upon the impressive success of CLIP (Contrastive Language-Image Pretraining),
recent pioneer works have proposed to adapt the powerful CLIP to video data, leading to …

保存引用被引用次数：8 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] thecvf.com

OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

T Chen, H Yu, Z Yang, Z Li, W Sun… - Proceedings of the …, 2024 - openaccess.thecvf.com

Due to the resource-intensive nature of training vision-language models on expansive video
data a majority of studies have centered on adapting pre-trained image-language models to …

保存引用被引用次数：4 相关文章所有 3 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Side4video: Spatial-temporal side network for memory-efficient image-to-video transfer learning

H Yao, W Wu, Z Li - arxiv preprint arxiv:2311.15769, 2023 - arxiv.org

Large pre-trained vision models achieve impressive success in computer vision. However,
fully fine-tuning large models for downstream tasks, particularly in video understanding, can …

保存引用被引用次数：9 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] thecvf.com

Enhanced Motion-Text Alignment for Image-to-Video Transfer Learning

W Zhang, C Wan, T Liu, X Tian… - Proceedings of the …, 2024 - openaccess.thecvf.com

Extending large image-text pre-trained models (eg CLIP) for video understanding has made
significant advancements. To enable the capability of CLIP to perceive dynamic information …

保存引用被引用次数：2 相关文章 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Rethinking image-to-video adaptation: An object-centric perspective

R Qian, S Ding, D Lin - European Conference on Computer Vision, 2024 - Springer

Image-to-video adaptation seeks to efficiently adapt image models for use in the video
domain. Instead of finetuning the entire image backbone, many image-to-video adaptation …

保存引用被引用次数：1 相关文章所有 5 个版本

[Free GPT-4]

[PDF] arxiv.org

Generating action-conditioned prompts for open-vocabulary video action recognition

C Jia, M Luo, X Chang, Z Dang, M Han… - Proceedings of the …, 2024 - dl.acm.org

Exploring open-vocabulary video action recognition is a promising venture, which aims to
recognize previously unseen actions within any arbitrary set of categories. Existing methods …

保存引用被引用次数：4 相关文章所有 2 个版本

[Free GPT-4]

[PDF] arxiv.org

Mote: Reconciling generalization with specialization for visual-language to video knowledge transfer

M Zhu, Z Wang, M Hu, R Dang, X Lin, X Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org

Transferring visual-language knowledge from large-scale foundation models for video
recognition has proved to be effective. To bridge the domain gap, additional parametric …

保存引用被引用次数：2 相关文章所有 3 个版本 HTML 版

Vlap: Efficient video-language alignment via frame prompting and distilling for video question answering

X Wang, J Liang, CK Wang, K Deng, Y Lou, MC Lin… - CoRR, 2023 - openreview.net

In this work, we propose an efficient Video-Language Alignment (ViLA) network. Our ViLA
model addresses both efficient frame sampling and effective cross-modal alignment in a …

保存引用被引用次数：5 相关文章所有 2 个版本网页快照

[Free GPT-4]

[PDF] arxiv.org

Dynamic and compressive adaptation of transformers from images to videos

G Zhang, J Liu, S Cao, X Zhao, K Zhao, K Ma… - arxiv preprint arxiv …, 2024 - arxiv.org

Recently, the remarkable success of pre-trained Vision Transformers (ViTs) from image-text
matching has sparked an interest in image-to-video adaptation. However, most current …

保存引用被引用次数：1 相关文章 HTML 版

创建快讯

引用

高级搜索

已保存到“我的图书馆”

Disentangling spatial and temporal learning for efficient image-to-video transfer learning

-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

Rethinking clip-based video learners in cross-domain open-vocabulary action recognition

OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

Side4video: Spatial-temporal side network for memory-efficient image-to-video transfer learning

Enhanced Motion-Text Alignment for Image-to-Video Transfer Learning

Rethinking image-to-video adaptation: An object-centric perspective

Generating action-conditioned prompts for open-vocabulary video action recognition

Mote: Reconciling generalization with specialization for visual-language to video knowledge transfer

Vlap: Efficient video-language alignment via frame prompting and distilling for video question answering

Dynamic and compressive adaptation of transformers from images to videos