Google Академик

Gpt (generative pre-trained transformer)–a comprehensive review on enabling technologies, potential applications, emerging challenges, and future directions

G Yenduri, M Ramalingam, GC Selvi, Y Supriya… - IEEE …, 2024 - ieeexplore.ieee.org

The Generative Pre-trained Transformer (GPT) represents a notable breakthrough in the
domain of natural language processing, which is propelling us toward the development of …

Сачувај Цитирај 208 пута наведен Сродни чланци Све верзије (4)

[Free GPT-4]
[DeepSeek]

[PDF] nowpublishers.com

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Сачувај Цитирај 198 пута наведен Сродни чланци Све верзије (7) Претрага библиотека HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

TS Chen, A Siarohin, W Menapace… - Proceedings of the …, 2024 - openaccess.thecvf.com

The quality of the data and annotation upper-bounds the quality of a downstream model.
While there exist large text corpora and image-text pairs high-quality video-text data is much …

Сачувај Цитирај 143 пута наведен Сродни чланци Све верзије (8) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

Сачувај Цитирај 243 пута наведен Сродни чланци Све верзије (19) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Any-to-any generation via composable diffusion

Z Tang, Z Yang, C Zhu, M Zeng… - Advances in Neural …, 2023 - proceedings.neurips.cc

Abstract We present Composable Diffusion (CoDi), a novel generative model capable of
generating any combination of output modalities, such as language, image, video, or audio …

Сачувај Цитирај 147 пута наведен Сродни чланци Све верзије (8) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Git: A generative image-to-text transformer for vision and language

J Wang, Z Yang, X Hu, L Li, K Lin, Z Gan, Z Liu… - arxiv preprint arxiv …, 2022 - arxiv.org

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify
vision-language tasks such as image/video captioning and question answering. While …

Сачувај Цитирај 572 пута наведен Сродни чланци Све верзије (4) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Learning video representations from large language models

Y Zhao, I Misra, P Krähenbühl… - Proceedings of the …, 2023 - openaccess.thecvf.com

We introduce LAVILA, a new approach to learning video-language representations by
leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be …

Сачувај Цитирај 179 пута наведен Сродни чланци Све верзије (8) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] springer.com

Large-scale multi-modal pre-trained models: A comprehensive survey

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer

With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

Сачувај Цитирај 199 пута наведен Сродни чланци Све верзије (8)

[Free GPT-4]
[DeepSeek]

[PDF] mlr.press

mplug-2: A modularized multi-modal foundation model across text, image and video

H Xu, Q Ye, M Yan, Y Shi, J Ye, Y Xu… - International …, 2023 - proceedings.mlr.press

Recent years have witnessed a big convergence of language, vision, and multi-modal
pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized …

Сачувај Цитирај 134 пута наведен Сродни чланци Све верзије (6) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Pali-x: On scaling up a multilingual vision and language model

X Chen, J Djolonga, P Padlewski, B Mustafa… - arxiv preprint arxiv …, 2023 - arxiv.org

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and
language model, both in terms of size of the components and the breadth of its training task …

Сачувај Цитирај 175 пута наведен Сродни чланци Све верзије (3) HTML верзија

Направи обавештење

Цитирај

Напредна претрага

Сачувано у мојој библиотеци

End-to-end generative pretraining for multimodal video captioning

Gpt (generative pre-trained transformer)–a comprehensive review on enabling technologies, potential applications, emerging challenges, and future directions

Vision-language pre-training: Basics, recent advances, and future trends

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

Any-to-any generation via composable diffusion

Git: A generative image-to-text transformer for vision and language

Learning video representations from large language models

Large-scale multi-modal pre-trained models: A comprehensive survey

mplug-2: A modularized multi-modal foundation model across text, image and video

Pali-x: On scaling up a multilingual vision and language model