- Academic Search

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Uložit Citovat Počet citací tohoto článku: 199 Související články Všechny verze (počet: 7) Hledat knihovnu Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] springer.com

Large-scale multi-modal pre-trained models: A comprehensive survey

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer

With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

Uložit Citovat Počet citací tohoto článku: 195 Související články Všechny verze (počet: 8)

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

Uložit Citovat Počet citací tohoto článku: 238 Související články Všechny verze (počet: 26) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] ieee.org

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

Uložit Citovat Počet citací tohoto článku: 637 Související články Všechny verze (počet: 9)

[Free GPT-4]
[DeepSeek]

[PDF] nowpublishers.com

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com

Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Uložit Citovat Počet citací tohoto článku: 219 Související články Všechny verze (počet: 6) Hledat knihovnu Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Layoutlmv3: Pre-training for document ai with unified text and image masking

Y Huang, T Lv, L Cui, Y Lu, F Wei - Proceedings of the 30th ACM …, 2022 - dl.acm.org

Self-supervised pre-training techniques have achieved remarkable progress in Document
AI. Most multimodal pre-trained models use a masked language modeling objective to learn …

Uložit Citovat Počet citací tohoto článku: 473 Související články Všechny verze (počet: 3)

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Flava: A foundational language and vision alignment model

A Singh, R Hu, V Goswami… - Proceedings of the …, 2022 - openaccess.thecvf.com

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic
pretraining for obtaining good performance on a variety of downstream tasks. Generally …

Uložit Citovat Počet citací tohoto článku: 739 Související články Všechny verze (počet: 6) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Ulip-2: Towards scalable multimodal pre-training for 3d understanding

L Xue, N Yu, S Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Recent advancements in multimodal pre-training have shown promising efficacy in 3D
representation learning by aligning multimodal features across 3D shapes their 2D …

Uložit Citovat Počet citací tohoto článku: 99 Související články Všechny verze (počet: 3) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

On the opportunities and risks of foundation models

R Bommasani, DA Hudson, E Adeli, R Altman… - arxiv preprint arxiv …, 2021 - arxiv.org

AI is undergoing a paradigm shift with the rise of models (eg, BERT, DALL-E, GPT-3) that are
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …

Uložit Citovat Počet citací tohoto článku: 4757 Související články Všechny verze (počet: 2) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Vision-language pre-training with triple contrastive learning

J Yang, J Duan, S Tran, Y Xu… - Proceedings of the …, 2022 - openaccess.thecvf.com

Vision-language representation learning largely benefits from image-text alignment through
contrastive losses (eg, InfoNCE loss). The success of this alignment strategy is attributed to …

Uložit Citovat Počet citací tohoto článku: 319 Související články Všechny verze (počet: 8) Zobrazit jako HTML

Citovat

Rozšířené vyhledávání

Uloženo do Mojí knihovny

Vision-language pre-training: Basics, recent advances, and future trends

Large-scale multi-modal pre-trained models: A comprehensive survey

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

Multimodal learning with transformers: A survey

Multimodal foundation models: From specialists to general-purpose assistants

Layoutlmv3: Pre-training for document ai with unified text and image masking

Flava: A foundational language and vision alignment model

Ulip-2: Towards scalable multimodal pre-training for 3d understanding

On the opportunities and risks of foundation models

Vision-language pre-training with triple contrastive learning