Multimodal research in vision and language: A review of current and emerging trends
S Uppal, S Bhagat, D Hazarika, N Majumder, S Poria… - Information …, 2022 - Elsevier
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …
with a diverse range of modalities present in the real-world data. More recently, this has …
How much can clip benefit vision-and-language tasks?
S Shen, LH Li, H Tan, M Bansal, A Rohrbach… - arxiv preprint arxiv …, 2021 - arxiv.org
Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using
a relatively small set of manually-annotated data (as compared to web-crawled data), to …
a relatively small set of manually-annotated data (as compared to web-crawled data), to …
Embodied navigation with multi-modal information: A survey from tasks to methodology
Embodied AI aims to create agents that complete complex tasks by interacting with the
environment. A key problem in this field is embodied navigation which understands multi …
environment. A key problem in this field is embodied navigation which understands multi …
Episodic transformer for vision-and-language navigation
A Pashevich, C Schmid, C Sun - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
Interaction and navigation defined by natural language instructions in dynamic
environments pose significant challenges for neural agents. This paper focuses on …
environments pose significant challenges for neural agents. This paper focuses on …
Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding
A Ku, P Anderson, R Patel, E Ie, J Baldridge - arxiv preprint arxiv …, 2020 - arxiv.org
We introduce Room-Across-Room (RxR), a new Vision-and-Language Navigation (VLN)
dataset. RxR is multilingual (English, Hindi, and Telugu) and larger (more paths and …
dataset. RxR is multilingual (English, Hindi, and Telugu) and larger (more paths and …
Airbert: In-domain pretraining for vision-and-language navigation
Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in
realistic environments using natural language instructions. Given the scarcity of domain …
realistic environments using natural language instructions. Given the scarcity of domain …
Vision-and-language navigation: A survey of tasks, methods, and future directions
J Gu, E Stefani, Q Wu, J Thomason… - arxiv preprint arxiv …, 2022 - arxiv.org
A long-term goal of AI research is to build intelligent agents that can communicate with
humans in natural language, perceive the environment, and perform real-world tasks. Vision …
humans in natural language, perceive the environment, and perform real-world tasks. Vision …
Bird's-Eye-View Scene Graph for Vision-Language Navigation
Abstract Vision-language navigation (VLN), which entails an agent to navigate 3D
environments following human instructions, has shown great advances. However, current …
environments following human instructions, has shown great advances. However, current …
Vision-language navigation with self-supervised auxiliary reasoning tasks
Abstract Vision-Language Navigation (VLN) is a task where an agent learns to navigate
following a natural language instruction. The key to this task is to perceive both the visual …
following a natural language instruction. The key to this task is to perceive both the visual …
Envedit: Environment editing for vision-and-language navigation
Abstract In Vision-and-Language Navigation (VLN), an agent needs to navigate through the
environment based on natural language instructions. Due to limited available data for agent …
environment based on natural language instructions. Due to limited available data for agent …