Google Académico

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Guardar Citar Citado por 197 Artículos relacionados Las 7 versiones Búsqueda de bibliotecas Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

Multimodal intelligence: Representation learning, information fusion, and applications

C Zhang, Z Yang, X He, L Deng - IEEE Journal of Selected …, 2020 - ieeexplore.ieee.org

Deep learning methods haverevolutionized speech recognition, image recognition, and
natural language processing since 2010. Each of these tasks involves a single modality in …

Guardar Citar Citado por 434 Artículos relacionados Las 3 versiones

[Free GPT-4]

[PDF] springer.com

Multiscale feature extraction and fusion of image and text in VQA

S Lu, Y Ding, M Liu, Z Yin, L Yin, W Zheng - International Journal of …, 2023 - Springer

Abstract The Visual Question Answering (VQA) system is the process of finding useful
information from images related to the question to answer the question correctly. It can be …

Guardar Citar Citado por 203 Artículos relacionados Las 5 versiones

[Free GPT-4]

[PDF] thecvf.com

Deep modular co-attention networks for visual question answering

Z Yu, J Yu, Y Cui, D Tao, Q Tian - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com

Abstract Visual Question Answering (VQA) requires a fine-grained and simultaneous
understanding of both the visual content of images and the textual content of questions …

Guardar Citar Citado por 1052 Artículos relacionados Las 11 versiones Versión en HTML

[Free GPT-4]

[PDF] researchgate.net

Attention, please! A survey of neural attention models in deep learning

A de Santana Correia, EL Colombini - Artificial Intelligence Review, 2022 - Springer

In humans, Attention is a core property of all perceptual and cognitive operations. Given our
limited ability to process competing sources, attention mechanisms select, modulate, and …

Guardar Citar Citado por 227 Artículos relacionados Las 8 versiones

[Free GPT-4]

[PDF] thecvf.com

Image retrieval on real-life images with pre-trained vision-and-language models

Z Liu, C Rodriguez-Opazo… - Proceedings of the …, 2021 - openaccess.thecvf.com

We extend the task of composed image retrieval, where an input query consists of an image
and short textual description of how to modify the image. Existing methods have only been …

Guardar Citar Citado por 190 Artículos relacionados Las 7 versiones Versión en HTML

[Free GPT-4]

[PDF] neurips.cc

Bilinear attention networks

JH Kim, J Jun, BT Zhang - Advances in neural information …, 2018 - proceedings.neurips.cc

Attention networks in multimodal learning provide an efficient way to utilize given visual
information selectively. However, the computational cost to learn attention distributions for …

Guardar Citar Citado por 1106 Artículos relacionados Las 8 versiones Versión en HTML

[Free GPT-4]

[PDF] thecvf.com

Residual attention network for image classification

F Wang, M Jiang, C Qian, S Yang… - Proceedings of the …, 2017 - openaccess.thecvf.com

In this work, we propose" Residual Attention Network", a convolutional neural network using
attention mechanism which can incorporate with state-of-art feed forward network …

Guardar Citar Citado por 4573 Artículos relacionados Las 10 versiones Versión en HTML

Deep multimodal learning: A survey on recent advances and trends

D Ramachandram, GW Taylor - IEEE signal processing …, 2017 - ieeexplore.ieee.org

The success of deep learning has been a catalyst to solving increasingly complex machine-
learning problems, which often involve multiple data modalities. We review recent advances …

Guardar Citar Citado por 1041 Artículos relacionados Las 3 versiones

[Free GPT-4]

[PDF] thecvf.com

Fashionvlp: Vision language transformer for fashion retrieval with feedback

S Goenka, Z Zheng, A Jaiswal… - Proceedings of the …, 2022 - openaccess.thecvf.com

Fashion image retrieval based on a query pair of reference image and natural language
feedback is a challenging task that requires models to assess fashion related information …

Guardar Citar Citado por 97 Artículos relacionados Las 5 versiones Versión en HTML

Crear alerta

Citar

Búsqueda avanzada

Guardado en Mi biblioteca

Multimodal residual learning for visual qa

Vision-language pre-training: Basics, recent advances, and future trends

Multimodal intelligence: Representation learning, information fusion, and applications

Multiscale feature extraction and fusion of image and text in VQA

Deep modular co-attention networks for visual question answering

Attention, please! A survey of neural attention models in deep learning

Image retrieval on real-life images with pre-trained vision-and-language models

Bilinear attention networks

Residual attention network for image classification

Deep multimodal learning: A survey on recent advances and trends

Fashionvlp: Vision language transformer for fashion retrieval with feedback