- Academic Search

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Enregistrer Citer Cité 197 fois Autres articles Les 7 versions Free GPT-4 Recherche dans les bibliothèques Version HTML

[Free GPT-4]

[PDF] arxiv.org

Self-supervised learning for videos: A survey

MC Schiappa, YS Rawat, M Shah - ACM Computing Surveys, 2023 - dl.acm.org

The remarkable success of deep learning in various domains relies on the availability of
large-scale annotated datasets. However, obtaining annotations is expensive and requires …

Enregistrer Citer Cité 154 fois Autres articles Les 4 versions Free GPT-4

[Free GPT-4]

[PDF] neurips.cc

Motiongpt: Human motion as a foreign language

B Jiang, X Chen, W Liu, J Yu, G Yu… - Advances in Neural …, 2023 - proceedings.neurips.cc

Though the advancement of pre-trained large language models unfolds, the exploration of
building a unified model for language and other multimodal data, such as motion, remains …

Enregistrer Citer Cité 258 fois Autres articles Les 5 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Videochat: Chat-centric video understanding

KC Li, Y He, Y Wang, Y Li, W Wang, P Luo… - ar** an end-to-end chat-centric video
understanding system, coined as VideoChat. It integrates video foundation models and …

Enregistrer Citer Cité 585 fois Autres articles Les 4 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Videomamba: State space model for efficient video understanding

K Li, X Li, Y Wang, Y He, Y Wang, L Wang… - European Conference on …, 2024 - Springer

Addressing the dual challenges of local redundancy and global dependencies in video
understanding, this work innovatively adapts the Mamba to the video domain. The proposed …

Enregistrer Citer Cité 147 fois Autres articles Les 2 versions Free GPT-4

Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

Enregistrer Citer Cité 118 fois Autres articles Les 3 versions Free GPT-4

[Free GPT-4]

[PDF] thecvf.com

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

Enregistrer Citer Cité 237 fois Autres articles Les 26 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Long-clip: Unlocking the long-text capability of clip

B Zhang, P Zhang, X Dong, Y Zang, J Wang - European Conference on …, 2024 - Springer

Abstract Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-
shot classification, text-image retrieval, and text-image generation by aligning image and …

Enregistrer Citer Cité 81 fois Autres articles Les 2 versions Free GPT-4

[Free GPT-4]

[PDF] thecvf.com

Moviechat: From dense token to sparse memory for long video understanding

E Song, W Chai, G Wang, Y Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Recently integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …

Enregistrer Citer Cité 179 fois Autres articles Les 3 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] ieee.org

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

Enregistrer Citer Cité 629 fois Autres articles Les 9 versions Free GPT-4

Créer l'alerte

Citer

Recherche avancée

Enregistré dans Ma bibliothèque

Videoclip: Contrastive pre-training for zero-shot video-text understanding

Vision-language pre-training: Basics, recent advances, and future trends

Self-supervised learning for videos: A survey

Motiongpt: Human motion as a foreign language

Videochat: Chat-centric video understanding

Videomamba: State space model for efficient video understanding

Internvideo2: Scaling foundation models for multimodal video understanding

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

Long-clip: Unlocking the long-text capability of clip

Moviechat: From dense token to sparse memory for long video understanding

Multimodal learning with transformers: A survey