Internvid: A large-scale video-text dataset for multimodal understanding and generation

Y Wang, Y He, Y Li, K Li, J Yu, X Ma, X Li… - arxiv preprint arxiv …, 2023‏ - arxiv.org
This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables
learning powerful and transferable video-text representations for multimodal understanding …

Language-based action concept spaces improve video self-supervised learning

K Ranasinghe, MS Ryoo - Advances in Neural Information …, 2023‏ - proceedings.neurips.cc
Recent contrastive language image pre-training has led to learning highly transferable and
robust image representations. However, adapting these models to video domain with …

Multi-granularity correspondence learning from long-term noisy videos

Y Lin, J Zhang, Z Huang, J Liu, Z Wen… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Existing video-language studies mainly focus on learning short video clips, leaving long-
term temporal dependencies rarely explored due to over-high computational cost of …

Mug-STAN: adapting image-language pretrained models for general video understanding

R Liu, J Huang, W Gao, TH Li, G Li - arxiv preprint arxiv:2311.15075, 2023‏ - arxiv.org
Large-scale image-language pretrained models, eg, CLIP, have demonstrated remarkable
proficiency in acquiring general multi-modal knowledge through web-scale image-text data …

Tvtsv2: Learning out-of-the-box spatiotemporal visual representations at scale

Z Zeng, Y Ge, Z Tong, X Liu, ST **a, Y Shan - arxiv preprint arxiv …, 2023‏ - arxiv.org
The ultimate goal for foundation models is realizing task-agnostic, ie, supporting out-of-the-
box usage without task-specific fine-tuning. Although breakthroughs have been made in …

Themis: A passive-active hybrid framework with in-network intelligence for lightweight failure localization

J **ao, Q Li, D Zhao, X Zuo, W Tang, Y Jiang - Computer Networks, 2024‏ - Elsevier
The fast and efficient failure detection and localization is essential for stable network
transmission. Unfortunately, existing schemes suffer from a few drawbacks such as …

Video-Language Alignment via Spatio-Temporal Graph Transformer

SX Zhang, H Wang, X Zhu, W Gu, T Zhang… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Video-language alignment is a crucial multi-modal task that benefits various downstream
applications, eg, video-text retrieval and video question answering. Existing methods either …

Concap: contrastive context-aware prompt for resource-hungry action recognition

H Zhang, Z Zeng, Q Zhao, Z Zhai - 2023 IEEE International …, 2023‏ - ieeexplore.ieee.org
Existing large-scale image-language pre-trained models, eg, CLIP [1], have revealed strong
spatial recognition capability on various vision tasks. However, they achieve inferior …