- Academic Search

Q Wu, D Teney, P Wang, C Shen, A Dick… - Computer Vision and …, 2017 - Elsevier

Abstract Visual Question Answering (VQA) is a challenging task that has received increasing
attention from both the computer vision and the natural language processing communities …

Enregistrer Citer Cité 504 fois Autres articles Les 5 versions Free GPT-4 DeepSeek

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities

MF Ishmam, MSH Shovon, MF Mridha, N Dey - Information Fusion, 2024 - Elsevier

The multimodal task of Visual Question Answering (VQA) encompassing elements of
Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers …

Enregistrer Citer Cité 25 fois Autres articles Les 3 versions Free GPT-4 DeepSeek

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Evaluating object hallucination in large vision-language models

Y Li, Y Du, K Zhou, J Wang, WX Zhao… - arxiv preprint arxiv …, 2023 - arxiv.org

Inspired by the superior language abilities of large language models (LLM), large vision-
language models (LVLM) have been recently explored by integrating powerful LLMs for …

Enregistrer Citer Cité 774 fois Autres articles Les 6 versions Free GPT-4 DeepSeek Version HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

3d-llm: Injecting the 3d world into large language models

Y Hong, H Zhen, P Chen, S Zheng… - Advances in …, 2023 - proceedings.neurips.cc

Large language models (LLMs) and Vision-Language Models (VLMs) have been proved to
excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be …

Enregistrer Citer Cité 261 fois Autres articles Les 9 versions Free GPT-4 DeepSeek Version HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Learn to explain: Multimodal reasoning via thought chains for science question answering

P Lu, S Mishra, T **a, L Qiu… - Advances in …, 2022 - proceedings.neurips.cc

When answering a question, humans utilize the information available across different
modalities to synthesize a consistent and complete chain of thought (CoT). This process is …

Enregistrer Citer Cité 972 fois Autres articles Les 8 versions Free GPT-4 DeepSeek Version HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Learning to answer questions in dynamic audio-visual scenarios

G Li, Y Wei, Y Tian, C Xu, JR Wen… - Proceedings of the …, 2022 - openaccess.thecvf.com

In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to
answer questions regarding different visual objects, sounds, and their associations in …

Enregistrer Citer Cité 134 fois Autres articles Les 8 versions Free GPT-4 DeepSeek Version HTML

[Free GPT-4]
[DeepSeek]

[PDF] aaai.org

Unified vision-language pre-training for image captioning and vqa

L Zhou, H Palangi, L Zhang, H Hu, J Corso… - Proceedings of the AAAI …, 2020 - ojs.aaai.org

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is
unified in that (1) it can be fine-tuned for either vision-language generation (eg, image …

Enregistrer Citer Cité 1040 fois Autres articles Les 8 versions Free GPT-4 DeepSeek Version HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Counterfactual vqa: A cause-effect look at language bias

Y Niu, K Tang, H Zhang, Z Lu… - Proceedings of the …, 2021 - openaccess.thecvf.com

Recent VQA models may tend to rely on language bias as a shortcut and thus fail to
sufficiently learn the multi-modal knowledge from both vision and language. In this paper …

Enregistrer Citer Cité 473 fois Autres articles Les 8 versions Free GPT-4 DeepSeek Version HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Gqa: A new dataset for real-world visual reasoning and compositional question answering

DA Hudson, CD Manning - … of the IEEE/CVF conference on …, 2019 - openaccess.thecvf.com

We introduce GQA, a new dataset for real-world visual reasoning and compositional
question answering, seeking to address key shortcomings of previous VQA datasets. We …

Enregistrer Citer Cité 2004 fois Autres articles Les 8 versions Free GPT-4 DeepSeek Version HTML

[Free GPT-4]
[DeepSeek]

[PDF] aclanthology.org

What matters in training a gpt4-style language model with multimodal inputs?

Y Zeng, H Zhang, J Zheng, J **a, G Wei… - Proceedings of the …, 2024 - aclanthology.org

Recent advancements in GPT-4V have displayed remarkable multi-modal capabilities in
processing image inputs and following open-ended instructions. Despite these …

Enregistrer Citer Cité 66 fois Autres articles Les 5 versions Free GPT-4 DeepSeek Version HTML

Créer l'alerte

Citer

Recherche avancée

Enregistré dans Ma bibliothèque

Yin and yang: Balancing and answering binary visual questions

Visual question answering: A survey of methods and datasets

From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities

Evaluating object hallucination in large vision-language models

3d-llm: Injecting the 3d world into large language models

Learn to explain: Multimodal reasoning via thought chains for science question answering

Learning to answer questions in dynamic audio-visual scenarios

Unified vision-language pre-training for image captioning and vqa

Counterfactual vqa: A cause-effect look at language bias

Gqa: A new dataset for real-world visual reasoning and compositional question answering

What matters in training a gpt4-style language model with multimodal inputs?