Multimodal machine learning: A survey and taxonomy

T Baltrušaitis, C Ahuja… - IEEE transactions on …, 2018 - ieeexplore.ieee.org
Our experience of the world is multimodal-we see objects, hear sounds, feel texture, smell
odors, and taste flavors. Modality refers to the way in which something happens or is …

Large-scale multi-modal pre-trained models: A comprehensive survey

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer
With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

Evaluating object hallucination in large vision-language models

Y Li, Y Du, K Zhou, J Wang, WX Zhao… - arxiv preprint arxiv …, 2023 - arxiv.org
Inspired by the superior language abilities of large language models (LLM), large vision-
language models (LVLM) have been recently explored by integrating powerful LLMs for …

Learning transferable visual models from natural language supervision

A Radford, JW Kim, C Hallacy… - International …, 2021 - proceedings.mlr.press
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined
object categories. This restricted form of supervision limits their generality and usability since …

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

P Sharma, N Ding, S Goodman… - Proceedings of the 56th …, 2018 - aclanthology.org
We present a new dataset of image caption annotations, Conceptual Captions, which
contains an order of magnitude more images than the MS-COCO dataset (Lin et al., 2014) …

Visual genome: Connecting language and vision using crowdsourced dense image annotations

R Krishna, Y Zhu, O Groth, J Johnson, K Hata… - International journal of …, 2017 - Springer
Despite progress in perceptual tasks such as image classification, computers still perform
poorly on cognitive tasks such as image description and question answering. Cognition is …

Supervised learning of universal sentence representations from natural language inference data

A Conneau, D Kiela, H Schwenk, L Barrault… - arxiv preprint arxiv …, 2017 - arxiv.org
Many modern NLP systems rely on word embeddings, previously trained in an unsupervised
manner on large corpora, as base features. Efforts to obtain embeddings for larger chunks of …

Vqa: Visual question answering

S Antol, A Agrawal, J Lu, M Mitchell… - Proceedings of the …, 2015 - openaccess.thecvf.com
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given
an image and a natural language question about the image, the task is to provide an …

Show, attend and tell: Neural image caption generation with visual attention

K Xu, J Ba, R Kiros, K Cho, A Courville… - International …, 2015 - proceedings.mlr.press
Inspired by recent work in machine translation and object detection, we introduce an
attention based model that automatically learns to describe the content of images. We …

Long-term recurrent convolutional networks for visual recognition and description

J Donahue, L Anne Hendricks… - Proceedings of the …, 2015 - openaccess.thecvf.com
Abstract Models comprised of deep convolutional network layers have dominated recent
image interpretation tasks; we investigate whether models which are also compositional, or" …