Google Наука

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Запазване Позоваване С позовавания в 198 Сродни статии Всички 7 версии Търсене на библиотеки Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

Запазване Позоваване С позовавания в 395 Сродни статии Всички 12 версии

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts

C Guo, X Zuo, S Wang, L Cheng - European Conference on Computer …, 2022 - Springer

Inspired by the strong ties between vision and language, the two intimate human sensing
and communication modalities, our paper aims to explore the generation of 3D human full …

Запазване Позоваване С позовавания в 212 Сродни статии Всички 10 версии

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Multi-modal knowledge graph construction and application: A survey

X Zhu, Z Li, X Wang, X Jiang, P Sun… - … on Knowledge and …, 2022 - ieeexplore.ieee.org

Recent years have witnessed the resurgence of knowledge engineering which is featured
by the fast growth of knowledge graphs. However, most of existing knowledge graphs are …

Запазване Позоваване С позовавания в 197 Сродни статии Всички 8 версии

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

RSTNet: Captioning with adaptive attention on visual and non-visual words

X Zhang, X Sun, Y Luo, J Ji, Y Zhou… - Proceedings of the …, 2021 - openaccess.thecvf.com

Recent progress on visual question answering has explored the merits of grid features for
vision language tasks. Meanwhile, transformer-based models have shown remarkable …

Запазване Позоваване С позовавания в 266 Сродни статии Всички 5 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Attention on attention for image captioning

L Huang, W Wang, J Chen… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com

Attention mechanisms are widely used in current encoder/decoder frameworks of image
captioning, where a weighted average on encoded vectors is generated at each time step to …

Запазване Позоваване С позовавания в 1156 Сродни статии Всички 11 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Visualgpt: Data-efficient adaptation of pretrained language models for image captioning

J Chen, H Guo, K Yi, B Li… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

The limited availability of annotated data often hinders real-world applications of machine
learning. To efficiently learn from small quantities of multimodal data, we leverage the …

Запазване Позоваване С позовавания в 248 Сродни статии Всички 13 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Videobert: A joint model for video and language representation learning

C Sun, A Myers, C Vondrick… - Proceedings of the …, 2019 - openaccess.thecvf.com

Self-supervised learning has become increasingly important to leverage the abundance of
unlabeled data available on platforms like YouTube. Whereas most existing approaches …

Запазване Позоваване С позовавания в 1490 Сродни статии Всички 10 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Learning conditional attributes for compositional zero-shot learning

Q Wang, L Liu, C **g, H Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Compositional Zero-Shot Learning (CZSL) aims to train models to recognize novel
compositional concepts based on learned concepts such as attribute-object combinations …

Запазване Позоваване С позовавания в 49 Сродни статии Всички 7 версии Във вид на HTML

Remote sensing image change captioning with dual-branch transformers: A new method and a large scale dataset

C Liu, R Zhao, H Chen, Z Zou… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org

Analyzing land cover changes with multitemporal remote sensing (RS) images is crucial for
environmental protection and land planning. In this article, we explore RS image change …

Запазване Позоваване С позовавания в 97 Сродни статии Всички 2 версии

Създаване на сигнал

Позоваване

Разширено търсене

Запазено в „Моята библиотека“

Babytalk: Understanding and generating simple image descriptions

Vision-language pre-training: Basics, recent advances, and future trends

From show to tell: A survey on deep learning-based image captioning

Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts

Multi-modal knowledge graph construction and application: A survey

RSTNet: Captioning with adaptive attention on visual and non-visual words

Attention on attention for image captioning

Visualgpt: Data-efficient adaptation of pretrained language models for image captioning

Videobert: A joint model for video and language representation learning

Learning conditional attributes for compositional zero-shot learning

Remote sensing image change captioning with dual-branch transformers: A new method and a large scale dataset