- Academic Search

Y Wang, Y He, Y Li, K Li, J Yu, X Ma, X Li… - arxiv preprint arxiv …, 2023‏ - arxiv.org‏

This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables
learning powerful and transferable video-text representations for multimodal understanding …‏

שמור צטט צוטט על ידי 235 מאמרים בנושא זה כל 5 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Language-based action concept spaces improve video self-supervised learning‏

K Ranasinghe, MS Ryoo - Advances in Neural Information …, 2023‏ - proceedings.neurips.cc‏

Recent contrastive language image pre-training has led to learning highly transferable and
robust image representations. However, adapting these models to video domain with …‏

שמור צטט צוטט על ידי 13 מאמרים בנושא זה כל 6 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Multi-granularity correspondence learning from long-term noisy videos‏

Y Lin, J Zhang, Z Huang, J Liu, Z Wen… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Existing video-language studies mainly focus on learning short video clips, leaving long-
term temporal dependencies rarely explored due to over-high computational cost of …‏

שמור צטט צוטט על ידי 15 מאמרים בנושא זה כל 4 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mug-STAN: adapting image-language pretrained models for general video understanding‏

R Liu, J Huang, W Gao, TH Li, G Li - arxiv preprint arxiv:2311.15075, 2023‏ - arxiv.org‏

Large-scale image-language pretrained models, eg, CLIP, have demonstrated remarkable
proficiency in acquiring general multi-modal knowledge through web-scale image-text data …‏

שמור צטט צוטט על ידי 12 מאמרים בנושא זה כל 3 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Tvtsv2: Learning out-of-the-box spatiotemporal visual representations at scale‏

Z Zeng, Y Ge, Z Tong, X Liu, ST **a, Y Shan - arxiv preprint arxiv …, 2023‏ - arxiv.org‏

The ultimate goal for foundation models is realizing task-agnostic, ie, supporting out-of-the-
box usage without task-specific fine-tuning. Although breakthroughs have been made in …‏

שמור צטט צוטט על ידי 8 מאמרים בנושא זה כל 3 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] ssrn.com

Themis: A passive-active hybrid framework with in-network intelligence for lightweight failure localization‏

J **ao, Q Li, D Zhao, X Zuo, W Tang, Y Jiang - Computer Networks, 2024‏ - Elsevier‏

The fast and efficient failure detection and localization is essential for stable network
transmission. Unfortunately, existing schemes suffer from a few drawbacks such as …‏

שמור צטט צוטט על ידי 2 מאמרים בנושא זה כל 2 הגרסאות

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Video-Language Alignment via Spatio-Temporal Graph Transformer‏

SX Zhang, H Wang, X Zhu, W Gu, T Zhang… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Video-language alignment is a crucial multi-modal task that benefits various downstream
applications, eg, video-text retrieval and video question answering. Existing methods either …‏

שמור צטט מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

Concap: contrastive context-aware prompt for resource-hungry action recognition‏

H Zhang, Z Zeng, Q Zhao, Z Zhai - 2023 IEEE International …, 2023‏ - ieeexplore.ieee.org‏

Existing large-scale image-language pre-trained models, eg, CLIP [1], have revealed strong
spatial recognition capability on various vision tasks. However, they achieve inferior …‏

שמור צטט צוטט על ידי 1 מאמרים בנושא זה כל 3 הגרסאות

יצירת התראה

צטט

חיפוש מתקדם

נשמר בספרייה שלי

Learning transferable spatiotemporal representations from natural script knowledge

Internvid: A large-scale video-text dataset for multimodal understanding and generation‏

Language-based action concept spaces improve video self-supervised learning‏

Multi-granularity correspondence learning from long-term noisy videos‏

Mug-STAN: adapting image-language pretrained models for general video understanding‏

Tvtsv2: Learning out-of-the-box spatiotemporal visual representations at scale‏

Themis: A passive-active hybrid framework with in-network intelligence for lightweight failure localization‏

Video-Language Alignment via Spatio-Temporal Graph Transformer‏

Concap: contrastive context-aware prompt for resource-hungry action recognition‏