Audio self-supervised learning: A survey
Similar to humans' cognitive ability to generalize knowledge and skills, self-supervised
learning (SSL) targets discovering general representations from large-scale data. This …
learning (SSL) targets discovering general representations from large-scale data. This …
Attention bottlenecks for multimodal fusion
Humans perceive the world by concurrently processing and fusing high-dimensional inputs
from multiple modalities such as vision and audio. Machine perception models, in stark …
from multiple modalities such as vision and audio. Machine perception models, in stark …
Wav2clip: Learning robust audio representations from clip
We propose Wav2CLIP, a robust audio representation learning method by distilling from
Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on …
Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on …
Visualvoice: Audio-visual speech separation with cross-modal consistency
We introduce a new approach for audio-visual speech separation. Given a video, the goal is
to extract the speech associated with a face in spite of simultaneous back-ground sounds …
to extract the speech associated with a face in spite of simultaneous back-ground sounds …
A closer look at weakly-supervised audio-visual source localization
Audio-visual source localization is a challenging task that aims to predict the location of
visual sound sources in a video. Since collecting ground-truth annotations of sounding …
visual sound sources in a video. Since collecting ground-truth annotations of sounding …
Audio-visual grou** network for sound localization from mixtures
Sound source localization is a typical and challenging task that predicts the location of
sound sources in a video. Previous single-source methods mainly used the audio-visual …
sound sources in a video. Previous single-source methods mainly used the audio-visual …
Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing
The audio-visual video parsing task aims to temporally parse a video into audio or visual
event categories. However, it is labor intensive to temporally annotate audio and visual …
event categories. However, it is labor intensive to temporally annotate audio and visual …
Sound source localization is all about cross-modal alignment
Humans can easily perceive the direction of sound sources in a visual scene, termed sound
source localization. Recent studies on learning-based sound source localization have …
source localization. Recent studies on learning-based sound source localization have …