A review of deep learning techniques for speech processing

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

Self-supervised speech representation learning: A review

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

Ego4d: Around the world in 3,000 hours of egocentric video

K Grauman, A Westbury, E Byrne… - Proceedings of the …, 2022 - openaccess.thecvf.com
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It
offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household …

Diffused heads: Diffusion models beat gans on talking-face generation

M Stypułkowski, K Vougioukas, S He… - Proceedings of the …, 2024 - openaccess.thecvf.com
Talking face generation has historically struggled to produce head movements and natural
facial expressions without guidance from additional reference videos. Recent developments …

Attention bottlenecks for multimodal fusion

A Nagrani, S Yang, A Arnab, A Jansen… - Advances in neural …, 2021 - proceedings.neurips.cc
Humans perceive the world by concurrently processing and fusing high-dimensional inputs
from multiple modalities such as vision and audio. Machine perception models, in stark …

What makes multi-modal learning better than single (provably)

Y Huang, C Du, Z Xue, X Chen… - Advances in Neural …, 2021 - proceedings.neurips.cc
The world provides us with data of multiple modalities. Intuitively, models fusing data from
different modalities outperform their uni-modal counterparts, since more information is …

CelebV-HQ: A large-scale video facial attributes dataset

H Zhu, W Wu, W Zhu, L Jiang, S Tang, L Zhang… - European conference on …, 2022 - Springer
Large-scale datasets have played indispensable roles in the recent success of face
generation/editing and significantly facilitated the advances of emerging research fields …

Neural target speech extraction: An overview

K Zmolikova, M Delcroix, T Ochiai… - IEEE Signal …, 2023 - ieeexplore.ieee.org
Humans can listen to a target speaker even in challenging acoustic conditions that have
noise, reverberation, and interfering speakers. This phenomenon is known as the cocktail …

Visual speech recognition for multiple languages in the wild

P Ma, S Petridis, M Pantic - Nature Machine Intelligence, 2022 - nature.com
Visual speech recognition (VSR) aims to recognize the content of speech based on lip
movements, without relying on the audio stream. Advances in deep learning and the …

Auto-avsr: Audio-visual speech recognition with automatic labels

P Ma, A Haliassos, A Fernandez-Lopez… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Audio-visual speech recognition has received a lot of attention due to its robustness against
acoustic noise. Recently, the performance of automatic, visual, and audio-visual speech …