Visual question answering: A survey of methods and datasets

Q Wu, D Teney, P Wang, C Shen, A Dick… - Computer Vision and …, 2017 - Elsevier
Abstract Visual Question Answering (VQA) is a challenging task that has received increasing
attention from both the computer vision and the natural language processing communities …

From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities

MF Ishmam, MSH Shovon, MF Mridha, N Dey - Information Fusion, 2024 - Elsevier
The multimodal task of Visual Question Answering (VQA) encompassing elements of
Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers …

Evaluating object hallucination in large vision-language models

Y Li, Y Du, K Zhou, J Wang, WX Zhao… - arxiv preprint arxiv …, 2023 - arxiv.org
Inspired by the superior language abilities of large language models (LLM), large vision-
language models (LVLM) have been recently explored by integrating powerful LLMs for …

3d-llm: Injecting the 3d world into large language models

Y Hong, H Zhen, P Chen, S Zheng… - Advances in …, 2023 - proceedings.neurips.cc
Large language models (LLMs) and Vision-Language Models (VLMs) have been proved to
excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be …

Learn to explain: Multimodal reasoning via thought chains for science question answering

P Lu, S Mishra, T **a, L Qiu… - Advances in …, 2022 - proceedings.neurips.cc
When answering a question, humans utilize the information available across different
modalities to synthesize a consistent and complete chain of thought (CoT). This process is …

Learning to answer questions in dynamic audio-visual scenarios

G Li, Y Wei, Y Tian, C Xu, JR Wen… - Proceedings of the …, 2022 - openaccess.thecvf.com
In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to
answer questions regarding different visual objects, sounds, and their associations in …

Unified vision-language pre-training for image captioning and vqa

L Zhou, H Palangi, L Zhang, H Hu, J Corso… - Proceedings of the AAAI …, 2020 - ojs.aaai.org
This paper presents a unified Vision-Language Pre-training (VLP) model. The model is
unified in that (1) it can be fine-tuned for either vision-language generation (eg, image …

Counterfactual vqa: A cause-effect look at language bias

Y Niu, K Tang, H Zhang, Z Lu… - Proceedings of the …, 2021 - openaccess.thecvf.com
Recent VQA models may tend to rely on language bias as a shortcut and thus fail to
sufficiently learn the multi-modal knowledge from both vision and language. In this paper …

Gqa: A new dataset for real-world visual reasoning and compositional question answering

DA Hudson, CD Manning - … of the IEEE/CVF conference on …, 2019 - openaccess.thecvf.com
We introduce GQA, a new dataset for real-world visual reasoning and compositional
question answering, seeking to address key shortcomings of previous VQA datasets. We …

What matters in training a gpt4-style language model with multimodal inputs?

Y Zeng, H Zhang, J Zheng, J **a, G Wei… - Proceedings of the …, 2024 - aclanthology.org
Recent advancements in GPT-4V have displayed remarkable multi-modal capabilities in
processing image inputs and following open-ended instructions. Despite these …