- Academic Search

T Baltrušaitis, C Ahuja… - IEEE transactions on …, 2018 - ieeexplore.ieee.org

Our experience of the world is multimodal-we see objects, hear sounds, feel texture, smell
odors, and taste flavors. Modality refers to the way in which something happens or is …

Save Cite Cited by 3890 Related articles All 11 versions Free GPT-4 DeepSeek

[Free GPT-4]
[DeepSeek]

[PDF] springer.com

Large-scale multi-modal pre-trained models: A comprehensive survey

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer

With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

Save Cite Cited by 199 Related articles All 8 versions Free GPT-4 DeepSeek

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Evaluating object hallucination in large vision-language models

Y Li, Y Du, K Zhou, J Wang, WX Zhao… - arxiv preprint arxiv …, 2023 - arxiv.org

Inspired by the superior language abilities of large language models (LLM), large vision-
language models (LVLM) have been recently explored by integrating powerful LLMs for …

Save Cite Cited by 800 Related articles All 6 versions Free GPT-4 DeepSeek View as HTML

[Free GPT-4]
[DeepSeek]

[PDF] mlr.press

Learning transferable visual models from natural language supervision

A Radford, JW Kim, C Hallacy… - International …, 2021 - proceedings.mlr.press

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined
object categories. This restricted form of supervision limits their generality and usability since …

Save Cite Cited by 29438 Related articles All 22 versions Free GPT-4 DeepSeek View as HTML

[Free GPT-4]
[DeepSeek]

[PDF] aclanthology.org

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

P Sharma, N Ding, S Goodman… - Proceedings of the 56th …, 2018 - aclanthology.org

We present a new dataset of image caption annotations, Conceptual Captions, which
contains an order of magnitude more images than the MS-COCO dataset (Lin et al., 2014) …

Save Cite Cited by 2707 Related articles All 2 versions Free GPT-4 DeepSeek View as HTML

[Free GPT-4]
[DeepSeek]

[PDF] springer.com

Visual genome: Connecting language and vision using crowdsourced dense image annotations

R Krishna, Y Zhu, O Groth, J Johnson, K Hata… - International journal of …, 2017 - Springer

Despite progress in perceptual tasks such as image classification, computers still perform
poorly on cognitive tasks such as image description and question answering. Cognition is …

Save Cite Cited by 6320 Related articles All 12 versions Free GPT-4 DeepSeek

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Supervised learning of universal sentence representations from natural language inference data

A Conneau, D Kiela, H Schwenk, L Barrault… - arxiv preprint arxiv …, 2017 - arxiv.org

Many modern NLP systems rely on word embeddings, previously trained in an unsupervised
manner on large corpora, as base features. Efforts to obtain embeddings for larger chunks of …

Save Cite Cited by 2715 Related articles All 9 versions Free GPT-4 DeepSeek View as HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Vqa: Visual question answering

S Antol, A Agrawal, J Lu, M Mitchell… - Proceedings of the …, 2015 - openaccess.thecvf.com

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given
an image and a natural language question about the image, the task is to provide an …

Save Cite Cited by 6751 Related articles All 31 versions Free GPT-4 DeepSeek View as HTML

[Free GPT-4]
[DeepSeek]

[PDF] mlr.press

Show, attend and tell: Neural image caption generation with visual attention

K Xu, J Ba, R Kiros, K Cho, A Courville… - International …, 2015 - proceedings.mlr.press

Inspired by recent work in machine translation and object detection, we introduce an
attention based model that automatically learns to describe the content of images. We …

Save Cite Cited by 13230 Related articles All 24 versions Free GPT-4 DeepSeek View as HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Long-term recurrent convolutional networks for visual recognition and description

J Donahue, L Anne Hendricks… - Proceedings of the …, 2015 - openaccess.thecvf.com

Abstract Models comprised of deep convolutional network layers have dominated recent
image interpretation tasks; we investigate whether models which are also compositional, or" …

Save Cite Cited by 8180 Related articles All 24 versions Free GPT-4 DeepSeek View as HTML

Create alert

Cite

Advanced search

Saved to My library

Framing image description as a ranking task: Data, models and evaluation metrics

Multimodal machine learning: A survey and taxonomy

Large-scale multi-modal pre-trained models: A comprehensive survey

Evaluating object hallucination in large vision-language models

Learning transferable visual models from natural language supervision

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Visual genome: Connecting language and vision using crowdsourced dense image annotations

Supervised learning of universal sentence representations from natural language inference data

Vqa: Visual question answering

Show, attend and tell: Neural image caption generation with visual attention

Long-term recurrent convolutional networks for visual recognition and description