Soundspaces 2.0: A simulation platform for visual-acoustic learning
Abstract We introduce SoundSpaces 2.0, a platform for on-the-fly geometry-based audio
rendering for 3D environments. Given a 3D mesh of a real-world environment …
rendering for 3D environments. Given a 3D mesh of a real-world environment …
Semantic audio-visual navigation
Recent work on audio-visual navigation assumes a constantly-sounding target and restricts
the role of audio to signaling the target's position. We introduce semantic audio-visual …
the role of audio to signaling the target's position. We introduce semantic audio-visual …
Toward practical monocular indoor depth estimation
The majority of prior monocular depth estimation methods without groundtruth depth
guidance focus on driving scenarios. We show that such methods generalize poorly to …
guidance focus on driving scenarios. We show that such methods generalize poorly to …
Few-shot audio-visual learning of environment acoustics
Room impulse response (RIR) functions capture how the surrounding physical environment
transforms the sounds heard by a listener, with implications for various applications in AR …
transforms the sounds heard by a listener, with implications for various applications in AR …
Pathdreamer: A world model for indoor navigation
People navigating in unfamiliar buildings take advantage of myriad visual, spatial and
semantic cues to efficiently achieve their navigation goals. Towards equip** …
semantic cues to efficiently achieve their navigation goals. Towards equip** …
Move2hear: Active audio-visual source separation
We introduce the active audio-visual source separation problem, where an agent must move
intelligently in order to better isolate the sounds coming from an object of interest in its …
intelligently in order to better isolate the sounds coming from an object of interest in its …
Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models
We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet
challenging problem in which an agent learns to navigate following a path described by …
challenging problem in which an agent learns to navigate following a path described by …
Listening human behavior: 3d human pose estimation with acoustic signals
Given only acoustic signals without any high-level information, such as voices or sounds of
scenes/actions, how much can we infer about the behavior of humans? Unlike existing …
scenes/actions, how much can we infer about the behavior of humans? Unlike existing …
Context understanding in computer vision: A survey
Contextual information plays an important role in many computer vision tasks, such as object
detection, video action detection, image classification, etc. Recognizing a single object or …
detection, video action detection, image classification, etc. Recognizing a single object or …
Disentangled counterfactual learning for physical audiovisual commonsense reasoning
In this paper, we propose a Disentangled Counterfactual Learning (DCL) approach for
physical audiovisual commonsense reasoning. The task aims to infer objects' physics …
physical audiovisual commonsense reasoning. The task aims to infer objects' physics …