Self-supervised speech representation learning: A review

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

Audio self-supervised learning: A survey

S Liu, A Mallol-Ragolta, E Parada-Cabaleiro, K Qian… - Patterns, 2022 - cell.com
Similar to humans' cognitive ability to generalize knowledge and skills, self-supervised
learning (SSL) targets discovering general representations from large-scale data. This …

Attention bottlenecks for multimodal fusion

A Nagrani, S Yang, A Arnab, A Jansen… - Advances in neural …, 2021 - proceedings.neurips.cc
Humans perceive the world by concurrently processing and fusing high-dimensional inputs
from multiple modalities such as vision and audio. Machine perception models, in stark …

Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation

A Ephrat, I Mosseri, O Lang, T Dekel, K Wilson… - arxiv preprint arxiv …, 2018 - arxiv.org
We present a joint audio-visual model for isolating a single speech signal from a mixture of
sounds such as other speakers and background noise. Solving this task using only audio as …

Look, listen and learn

R Arandjelovic, A Zisserman - Proceedings of the IEEE …, 2017 - openaccess.thecvf.com
We consider the question: what can be learnt by looking at and listening to a large number
of unlabelled videos? There is a valuable, but so far untapped, source of information …

Objects that sound

R Arandjelovic, A Zisserman - Proceedings of the European …, 2018 - openaccess.thecvf.com
In this paper our objectives are, first, networks that can embed audio and visual inputs into a
common space that is suitable for cross-modal retrieval; and second, a network that can …

Audio-visual event localization in unconstrained videos

Y Tian, J Shi, B Li, Z Duan, C Xu - Proceedings of the …, 2018 - openaccess.thecvf.com
In this paper, we introduce a novel problem of audio-visual event localization in
unconstrained videos. We define an audio-visual event as an event that is both visible and …

Dual attention matching for audio-visual event localization

Y Wu, L Zhu, Y Yan, Y Yang - Proceedings of the IEEE/CVF …, 2019 - openaccess.thecvf.com
In this paper, we investigate the audio-visual event localization problem. This task is to
localize a visible and audible event in a video. Previous methods first divide a video into …

Learning representations from audio-visual spatial alignment

P Morgado, Y Li… - Advances in Neural …, 2020 - proceedings.neurips.cc
We introduce a novel self-supervised pretext task for learning representations from audio-
visual content. Prior work on audio-visual representation learning leverages …

TVLT: Textless vision-language transformer

Z Tang, J Cho, Y Nie, M Bansal - Advances in neural …, 2022 - proceedings.neurips.cc
In this work, we present the Textless Vision-Language Transformer (TVLT), where
homogeneous transformer blocks take raw visual and audio inputs for vision-and-language …