A survey of embodied ai: From simulators to research tasks
There has been an emerging paradigm shift from the era of “internet AI” to “embodied AI,”
where AI algorithms and agents no longer learn from datasets of images, videos or text …
where AI algorithms and agents no longer learn from datasets of images, videos or text …
Ego4d: Around the world in 3,000 hours of egocentric video
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It
offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household …
offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household …
Threedworld: A platform for interactive multi-modal physical simulation
We introduce ThreeDWorld (TDW), a platform for interactive multi-modal physical simulation.
TDW enables simulation of high-fidelity sensory data and physical interactions between …
TDW enables simulation of high-fidelity sensory data and physical interactions between …
Look, listen, and act: Towards audio-visual embodied navigation
A crucial ability of mobile intelligent agents is to integrate the evidence from multiple sensory
inputs in an environment and to make a sequence of actions to reach their goals. In this …
inputs in an environment and to make a sequence of actions to reach their goals. In this …
Sep-stereo: Visually guided stereophonic audio generation by associating source separation
Stereophonic audio is an indispensable ingredient to enhance human auditory experience.
Recent research has explored the usage of visual information as guidance to generate …
Recent research has explored the usage of visual information as guidance to generate …
VisualEchoes: Spatial Image Representation Learning Through Echolocation
Several animal species (eg, bats, dolphins, and whales) and even visually impaired humans
have the remarkable ability to perform echolocation: a biological sonar used to perceive …
have the remarkable ability to perform echolocation: a biological sonar used to perceive …
Vision-language navigation: a survey and taxonomy
Vision-language navigation (VLN) tasks require an agent to follow language instructions
from a human guide to navigate in previously unseen environments using visual …
from a human guide to navigate in previously unseen environments using visual …
See, hear, explore: Curiosity via audio-visual association
Exploration is one of the core challenges in reinforcement learning. A common formulation
of curiosity-driven exploration uses the difference between the real future and the future …
of curiosity-driven exploration uses the difference between the real future and the future …
Audio-visual floorplan reconstruction
Given only a few glimpses of an environment, how much can we infer about its entire
floorplan? Existing methods can map only what is visible or immediately apparent from …
floorplan? Existing methods can map only what is visible or immediately apparent from …
Language-guided audio-visual source separation via trimodal consistency
We propose a self-supervised approach for learning to perform audio source separation in
videos based on natural language queries, using only unlabeled video and audio pairs as …
videos based on natural language queries, using only unlabeled video and audio pairs as …