Self-supervised speech representation learning: A review
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …
necessitated the building of specialist models for individual tasks and application scenarios …
Audio self-supervised learning: A survey
Similar to humans' cognitive ability to generalize knowledge and skills, self-supervised
learning (SSL) targets discovering general representations from large-scale data. This …
learning (SSL) targets discovering general representations from large-scale data. This …
Attention bottlenecks for multimodal fusion
Humans perceive the world by concurrently processing and fusing high-dimensional inputs
from multiple modalities such as vision and audio. Machine perception models, in stark …
from multiple modalities such as vision and audio. Machine perception models, in stark …
Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation
We present a joint audio-visual model for isolating a single speech signal from a mixture of
sounds such as other speakers and background noise. Solving this task using only audio as …
sounds such as other speakers and background noise. Solving this task using only audio as …
Look, listen and learn
We consider the question: what can be learnt by looking at and listening to a large number
of unlabelled videos? There is a valuable, but so far untapped, source of information …
of unlabelled videos? There is a valuable, but so far untapped, source of information …
Objects that sound
In this paper our objectives are, first, networks that can embed audio and visual inputs into a
common space that is suitable for cross-modal retrieval; and second, a network that can …
common space that is suitable for cross-modal retrieval; and second, a network that can …
Audio-visual event localization in unconstrained videos
In this paper, we introduce a novel problem of audio-visual event localization in
unconstrained videos. We define an audio-visual event as an event that is both visible and …
unconstrained videos. We define an audio-visual event as an event that is both visible and …
Dual attention matching for audio-visual event localization
In this paper, we investigate the audio-visual event localization problem. This task is to
localize a visible and audible event in a video. Previous methods first divide a video into …
localize a visible and audible event in a video. Previous methods first divide a video into …
Learning representations from audio-visual spatial alignment
We introduce a novel self-supervised pretext task for learning representations from audio-
visual content. Prior work on audio-visual representation learning leverages …
visual content. Prior work on audio-visual representation learning leverages …
TVLT: Textless vision-language transformer
In this work, we present the Textless Vision-Language Transformer (TVLT), where
homogeneous transformer blocks take raw visual and audio inputs for vision-and-language …
homogeneous transformer blocks take raw visual and audio inputs for vision-and-language …