A survey of embodied ai: From simulators to research tasks

J Duan, S Yu, HL Tan, H Zhu… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
There has been an emerging paradigm shift from the era of “internet AI” to “embodied AI,”
where AI algorithms and agents no longer learn from datasets of images, videos or text …

Ego4d: Around the world in 3,000 hours of egocentric video

K Grauman, A Westbury, E Byrne… - Proceedings of the …, 2022 - openaccess.thecvf.com
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It
offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household …

Threedworld: A platform for interactive multi-modal physical simulation

C Gan, J Schwartz, S Alter, D Mrowca… - arxiv preprint arxiv …, 2020 - arxiv.org
We introduce ThreeDWorld (TDW), a platform for interactive multi-modal physical simulation.
TDW enables simulation of high-fidelity sensory data and physical interactions between …

Look, listen, and act: Towards audio-visual embodied navigation

C Gan, Y Zhang, J Wu, B Gong… - … on Robotics and …, 2020 - ieeexplore.ieee.org
A crucial ability of mobile intelligent agents is to integrate the evidence from multiple sensory
inputs in an environment and to make a sequence of actions to reach their goals. In this …

Sep-stereo: Visually guided stereophonic audio generation by associating source separation

H Zhou, X Xu, D Lin, X Wang, Z Liu - … , Glasgow, UK, August 23–28, 2020 …, 2020 - Springer
Stereophonic audio is an indispensable ingredient to enhance human auditory experience.
Recent research has explored the usage of visual information as guidance to generate …

VisualEchoes: Spatial Image Representation Learning Through Echolocation

R Gao, C Chen, Z Al-Halah, C Schissler… - Computer Vision–ECCV …, 2020 - Springer
Several animal species (eg, bats, dolphins, and whales) and even visually impaired humans
have the remarkable ability to perform echolocation: a biological sonar used to perceive …

Vision-language navigation: a survey and taxonomy

W Wu, T Chang, X Li, Q Yin, Y Hu - Neural Computing and Applications, 2024 - Springer
Vision-language navigation (VLN) tasks require an agent to follow language instructions
from a human guide to navigate in previously unseen environments using visual …

See, hear, explore: Curiosity via audio-visual association

V Dean, S Tulsiani, A Gupta - Advances in neural …, 2020 - proceedings.neurips.cc
Exploration is one of the core challenges in reinforcement learning. A common formulation
of curiosity-driven exploration uses the difference between the real future and the future …

Audio-visual floorplan reconstruction

S Purushwalkam, SVA Gari, VK Ithapu… - Proceedings of the …, 2021 - openaccess.thecvf.com
Given only a few glimpses of an environment, how much can we infer about its entire
floorplan? Existing methods can map only what is visible or immediately apparent from …

Language-guided audio-visual source separation via trimodal consistency

R Tan, A Ray, A Burns, BA Plummer… - Proceedings of the …, 2023 - openaccess.thecvf.com
We propose a self-supervised approach for learning to perform audio source separation in
videos based on natural language queries, using only unlabeled video and audio pairs as …