Knowledge graphs meet multi-modal learning: A comprehensive survey

Z Chen, Y Zhang, Y Fang, Y Geng, L Guo… - arxiv preprint arxiv …, 2024 - arxiv.org
Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the
semantic web community's exploration into multi-modal dimensions unlocking new avenues …

From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities

MF Ishmam, MSH Shovon, MF Mridha, N Dey - Information Fusion, 2024 - Elsevier
The multimodal task of Visual Question Answering (VQA) encompassing elements of
Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers …

A-okvqa: A benchmark for visual question answering using world knowledge

D Schwenk, A Khandelwal, C Clark, K Marino… - European conference on …, 2022 - Springer
Abstract The Visual Question Answering (VQA) task aspires to provide a meaningful testbed
for the development of AI models that can jointly reason over visual and natural language …

Wiki-llava: Hierarchical retrieval-augmented generation for multimodal llms

D Caffagni, F Cocchi, N Moratelli… - Proceedings of the …, 2024 - openaccess.thecvf.com
Multimodal LLMs are the natural evolution of LLMs and enlarge their capabilities so as to
work beyond the pure textual modality. As research is being carried out to design novel …

Transform-retrieve-generate: Natural language-centric outside-knowledge visual question answering

F Gao, Q **, G Thattai, A Reganti… - Proceedings of the …, 2022 - openaccess.thecvf.com
Outside-knowledge visual question answering (OK-VQA) requires the agent to comprehend
the image, make use of relevant knowledge from the entire web, and digest all the …

Can pre-trained vision and language models answer visual information-seeking questions?

Y Chen, H Hu, Y Luan, H Sun, S Changpinyo… - arxiv preprint arxiv …, 2023 - arxiv.org
Pre-trained vision and language models have demonstrated state-of-the-art capabilities over
existing tasks involving images and texts, including visual question answering. However, it …

Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories

T Mensink, J Uijlings, L Castrejon… - Proceedings of the …, 2023 - openaccess.thecvf.com
We propose Encyclopedic-VQA, a large scale visual question answering (VQA) dataset
featuring visual questions about detailed properties of fine-grained categories and …

A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering

Y Li, L Wang, B Hu, X Chen, W Zhong, C Lyu… - arxiv preprint arxiv …, 2023 - arxiv.org
The emergence of multimodal large models (MLMs) has significantly advanced the field of
visual understanding, offering remarkable capabilities in the realm of visual question …

Weakly-supervised visual-retriever-reader for knowledge-based question answering

M Luo, Y Zeng, P Banerjee, C Baral - arxiv preprint arxiv:2109.04014, 2021 - arxiv.org
Knowledge-based visual question answering (VQA) requires answering questions with
external knowledge in addition to the content of images. One dataset that is mostly used in …

Lako: Knowledge-driven visual question answering via late knowledge-to-text injection

Z Chen, Y Huang, J Chen, Y Geng, Y Fang… - Proceedings of the 11th …, 2022 - dl.acm.org
Visual question answering (VQA) often requires an understanding of visual concepts and
language semantics, which relies on external knowledge. Most existing methods exploit pre …