Visqa: X-raying vision and language reasoning in transformers

T Jaunet, C Kervadec, R Vuillemot… - … on Visualization and …, 2021 - ieeexplore.ieee.org
Visual Question Answering systems target answering open-ended textual questions given
input images. They are a testbed for learning high-level reasoning with a primary use in HCI …

A critical analysis of benchmarks, techniques, and models in medical visual question answering

S Al-Hadhrami, MEB Menai, S Al-Ahmadi… - IEEE …, 2023 - ieeexplore.ieee.org
This paper comprehensively reviews medical VQA models, structures, and datasets,
focusing on combining vision and language. Over 75 models and their statistical and SWOT …

Unsupervised and pseudo-supervised vision-language alignment in visual dialog

F Chen, D Zhang, X Chen, J Shi, S Xu… - Proceedings of the 30th …, 2022 - dl.acm.org
Visual dialog requires models to give reasonable answers according to a series of coherent
questions and related visual concepts in images. However, most current work either focuses …

Weakly supervised relative spatial reasoning for visual question answering

P Banerjee, T Gokhale, Y Yang… - Proceedings of the …, 2021 - openaccess.thecvf.com
Abstract Vision-and-language (V&L) reasoning necessitates perception of visual concepts
such as objects and actions, understanding semantics and language grounding, and …

How transferable are reasoning patterns in VQA?

C Kervadec, T Jaunet, G Antipov… - Proceedings of the …, 2021 - openaccess.thecvf.com
Abstract Since its inception, Visual Question Answering (VQA) is notoriously known as a
task, where models are prone to exploit biases in datasets to find shortcuts instead of …

Knowledge-embedded mutual guidance for visual reasoning

W Zheng, L Yan, L Chen, Q Li… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Visual reasoning between visual images and natural language is a long-standing challenge
in computer vision. Most of the methods aim to look for answers to questions only on the …

Self-attention guided representation learning for image-text matching

X Qi, Y Zhang, J Qi, H Lu - Neurocomputing, 2021 - Elsevier
Image-text matching plays an important role in bridging vision and language. Most existing
research works embed both images and sentences into a joint latent space to measure their …

RG-SAN: Rule-Guided Spatial Awareness Network for End-to-End 3D Referring Expression Segmentation

C Wu, Q Chen, J Ji, H Wang, Y Ma, Y Huang… - arxiv preprint arxiv …, 2024 - arxiv.org
3D Referring Expression Segmentation (3D-RES) aims to segment 3D objects by correlating
referring expressions with point clouds. However, traditional approaches frequently …

Webly supervised knowledge-embedded model for visual reasoning

W Zheng, L Yan, W Zhang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Visual reasoning between visual images and natural language remains a long-standing
challenge in computer vision. Conventional deep supervision methods target at finding …

Supervising the transfer of reasoning patterns in vqa

C Kervadec, C Wolf, G Antipov… - Advances in Neural …, 2021 - proceedings.neurips.cc
Abstract Methods for Visual Question Anwering (VQA) are notorious for leveraging dataset
biases rather than performing reasoning, hindering generalization. It has been recently …