Multimodal research in vision and language: A review of current and emerging trends

S Uppal, S Bhagat, D Hazarika, N Majumder, S Poria… - Information …, 2022 - Elsevier
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …

Affordances from human videos as a versatile representation for robotics

S Bahl, R Mendonca, L Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com
Building a robot that can understand and learn to interact by watching humans has inspired
several vision problems. However, despite some successful results on static datasets, it …

Convolutional image captioning

J Aneja, A Deshpande… - Proceedings of the IEEE …, 2018 - openaccess.thecvf.com
Image captioning is an important task, applicable to virtual assistants, editing tools, image
indexing, and support of the disabled. In recent years significant progress has been made in …

Out of the box: Reasoning with graph convolution nets for factual visual question answering

M Narasimhan, S Lazebnik… - Advances in neural …, 2018 - proceedings.neurips.cc
Accurately answering a question about a given image requires combining observations with
general knowledge. While this is effortless for humans, reasoning with general knowledge …

Two causal principles for improving visual dialog

J Qi, Y Niu, J Huang, H Zhang - Proceedings of the IEEE …, 2020 - openaccess.thecvf.com
This paper unravels the design tricks adopted by us, the champion team MReaL-BDAI, for
Visual Dialog Challenge 2019: two causal principles for improving Visual Dialog (VisDial) …

Audio visual scene-aware dialog

H Alamri, V Cartillier, A Das, J Wang… - Proceedings of the …, 2019 - openaccess.thecvf.com
We introduce the task of scene-aware dialog. Our goal is to generate a complete and natural
response to a question about a scene, given video and audio of the scene and the history of …

Trends in integration of vision and language research: A survey of tasks, datasets, and methods

A Mogadala, M Kalimuthu, D Klakow - Journal of Artificial Intelligence …, 2021 - jair.org
Abstract Interest in Artificial Intelligence (AI) and its applications has seen unprecedented
growth in the last few years. This success can be partly attributed to the advancements made …

Large-scale pretraining for visual dialog: A simple state-of-the-art baseline

V Murahari, D Batra, D Parikh, A Das - European Conference on Computer …, 2020 - Springer
Prior work in visual dialog has focused on training deep neural models on VisDial in
isolation. Instead, we present an approach to leverage pretraining on related vision …

Removing bias in multi-modal classifiers: Regularization by maximizing functional entropies

I Gat, I Schwartz, A Schwing… - Advances in Neural …, 2020 - proceedings.neurips.cc
Many recent datasets contain a variety of different data modalities, for instance, image,
question, and answer data in visual question answering (VQA). When training deep net …

Reasoning visual dialogs with structural and partial observations

Z Zheng, W Wang, S Qi, SC Zhu - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com
We propose a novel model to address the task of Visual Dialog which exhibits complex
dialog structures. To obtain a reasonable answer based on the current question and the …