- Academic Search

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Salva Cita Citato da 198 Articoli correlati Tutte e 7 le versioni Ricerca biblioteche Versione HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

Salva Cita Citato da 237 Articoli correlati Tutte e 19 le versioni Versione HTML

[Free GPT-4]
[DeepSeek]

[PDF] ieee.org

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

Salva Cita Citato da 640 Articoli correlati Tutte e 11 le versioni

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

X-clip: End-to-end multi-grained contrastive learning for video-text retrieval

Y Ma, G Xu, X Sun, M Yan, J Zhang, R Ji - Proceedings of the 30th ACM …, 2022 - dl.acm.org

Video-text retrieval has been a crucial and fundamental task in multi-modal research. The
development of video-text retrieval has been considerably promoted by large-scale multi …

Salva Cita Citato da 269 Articoli correlati Tutte e 3 le versioni

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

All in one: Exploring unified video-language pre-training

J Wang, Y Ge, R Yan, Y Ge, KQ Lin… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Mainstream Video-Language Pre-training models consist of three parts, a video
encoder, a text encoder, and a video-text fusion Transformer. They pursue better …

Salva Cita Citato da 228 Articoli correlati Tutte e 4 le versioni Versione HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Egovlpv2: Egocentric video-language pre-training with fusion in the backbone

S Pramanick, Y Song, S Nag, KQ Lin… - Proceedings of the …, 2023 - openaccess.thecvf.com

Video-language pre-training (VLP) has become increasingly important due to its ability to
generalize to various vision and language tasks. However, existing egocentric VLP …

Salva Cita Citato da 69 Articoli correlati Tutte e 7 le versioni Versione HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Verbs in action: Improving verb understanding in video-language models

L Momeni, M Caron, A Nagrani… - Proceedings of the …, 2023 - openaccess.thecvf.com

Understanding verbs is crucial to modelling how people and objects interact with each other
and the environment through space and time. Recently, state-of-the-art video-language …

Salva Cita Citato da 74 Articoli correlati Tutte e 7 le versioni Versione HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning

P **, J Huang, P **ong, S Tian, C Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Contrastive learning-based video-language representation learning approaches, eg, CLIP,
have achieved outstanding performance, which pursue semantic interaction upon pre …

Salva Cita Citato da 68 Articoli correlati Tutte e 7 le versioni Versione HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Ts2-net: Token shift and selection transformer for text-video retrieval

Y Liu, P **ong, L Xu, S Cao, Q ** - European conference on computer …, 2022 - Springer

Text-Video retrieval is a task of great practical value and has received increasing attention,
among which learning spatial-temporal video representation is one of the research hotspots …

Salva Cita Citato da 139 Articoli correlati Tutte e 6 le versioni

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Violet: End-to-end video-language transformers with masked visual-token modeling

TJ Fu, L Li, Z Gan, K Lin, WY Wang, L Wang… - arxiv preprint arxiv …, 2021 - arxiv.org

A great challenge in video-language (VidL) modeling lies in the disconnection between
fixed video representations extracted from image/video understanding models and …

Salva Cita Citato da 223 Articoli correlati Tutte e 2 le versioni Versione HTML

Crea avviso

Cita

Ricerca avanzata

Salvato in La mia biblioteca

Taco: Token-aware cascade contrastive learning for video-text alignment

Vision-language pre-training: Basics, recent advances, and future trends

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

Multimodal learning with transformers: A survey

X-clip: End-to-end multi-grained contrastive learning for video-text retrieval

All in one: Exploring unified video-language pre-training

Egovlpv2: Egocentric video-language pre-training with fusion in the backbone

Verbs in action: Improving verb understanding in video-language models

Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning

Ts2-net: Token shift and selection transformer for text-video retrieval

Violet: End-to-end video-language transformers with masked visual-token modeling