A review of recent advances on deep learning methods for audio-visual speech recognition
This article provides a detailed review of recent advances in audio-visual speech
recognition (AVSR) methods that have been developed over the last decade (2013–2023) …
recognition (AVSR) methods that have been developed over the last decade (2013–2023) …
Selftalk: A self-supervised commutative training diagram to comprehend 3d talking faces
Speech-driven 3D face animation technique, extending its applications to various
multimedia fields. Previous research has generated promising realistic lip movements and …
multimedia fields. Previous research has generated promising realistic lip movements and …
Audio–visual speech recognition based on regulated transformer and spatio–temporal fusion strategy for driver assistive systems
This article presents a research methodology for audio–visual speech recognition (AVSR) in
driver assistive systems. These systems necessitate ongoing interaction with drivers while …
driver assistive systems. These systems necessitate ongoing interaction with drivers while …
Synthvsr: Scaling up visual speech recognition with synthetic supervision
Recently reported state-of-the-art results in visual speech recognition (VSR) often rely on
increasingly large amounts of video data, while the publicly available transcribed video …
increasingly large amounts of video data, while the publicly available transcribed video …
Multilingual audio-visual speech recognition with hybrid CTC/RNN-T fast conformer
Humans are adept at leveraging visual cues from lip movements for recognizing speech in
adverse listening conditions. Audio-Visual Speech Recognition (AVSR) models follow …
adverse listening conditions. Audio-Visual Speech Recognition (AVSR) models follow …
Lip reading for low-resource languages by learning and combining general speech knowledge and language-specific knowledge
This paper proposes a novel lip reading framework, especially for low-resource languages,
which has not been well addressed in the previous literature. Since low-resource languages …
which has not been well addressed in the previous literature. Since low-resource languages …
A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition
Abstract Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed
to be sensitive to missing video frames performing even worse than single-modality models …
to be sensitive to missing video frames performing even worse than single-modality models …
Lost in Translation: Lip-Sync Deepfake Detection from Audio-Video Mismatch
Highly realistic voice cloning combined with AI-powered video manipulation allows for the
creation of compelling lip-sync deepfakes where anyone can be made to say things they …
creation of compelling lip-sync deepfakes where anyone can be made to say things they …
TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch
TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims
to accelerate the research and development of audio and speech technologies by providing …
to accelerate the research and development of audio and speech technologies by providing …
Mlca-avsr: Multi-layer cross attention fusion based audio-visual speech recognition
While automatic speech recognition (ASR) systems degrade significantly in noisy
environments, audio-visual speech recognition (AVSR) systems aim to complement the …
environments, audio-visual speech recognition (AVSR) systems aim to complement the …