A review of recent advances on deep learning methods for audio-visual speech recognition

D Ivanko, D Ryumin, A Karpov - Mathematics, 2023 - mdpi.com
This article provides a detailed review of recent advances in audio-visual speech
recognition (AVSR) methods that have been developed over the last decade (2013–2023) …

Selftalk: A self-supervised commutative training diagram to comprehend 3d talking faces

Z Peng, Y Luo, Y Shi, H Xu, X Zhu, H Liu, J He… - Proceedings of the 31st …, 2023 - dl.acm.org
Speech-driven 3D face animation technique, extending its applications to various
multimedia fields. Previous research has generated promising realistic lip movements and …

Audio–visual speech recognition based on regulated transformer and spatio–temporal fusion strategy for driver assistive systems

D Ryumin, A Axyonov, E Ryumina, D Ivanko… - Expert Systems with …, 2024 - Elsevier
This article presents a research methodology for audio–visual speech recognition (AVSR) in
driver assistive systems. These systems necessitate ongoing interaction with drivers while …

Synthvsr: Scaling up visual speech recognition with synthetic supervision

X Liu, E Lakomkin, K Vougioukas… - Proceedings of the …, 2023 - openaccess.thecvf.com
Recently reported state-of-the-art results in visual speech recognition (VSR) often rely on
increasingly large amounts of video data, while the publicly available transcribed video …

Multilingual audio-visual speech recognition with hybrid CTC/RNN-T fast conformer

M Burchi, KC Puvvada, J Balam… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
Humans are adept at leveraging visual cues from lip movements for recognizing speech in
adverse listening conditions. Audio-Visual Speech Recognition (AVSR) models follow …

Lip reading for low-resource languages by learning and combining general speech knowledge and language-specific knowledge

M Kim, JH Yeo, J Choi, YM Ro - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
This paper proposes a novel lip reading framework, especially for low-resource languages,
which has not been well addressed in the previous literature. Since low-resource languages …

A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition

Y Dai, H Chen, J Du, R Wang, S Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed
to be sensitive to missing video frames performing even worse than single-modality models …

Lost in Translation: Lip-Sync Deepfake Detection from Audio-Video Mismatch

M Bohacek, H Farid - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
Highly realistic voice cloning combined with AI-powered video manipulation allows for the
creation of compelling lip-sync deepfakes where anyone can be made to say things they …

TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch

J Hwang, M Hira, C Chen, X Zhang, Z Ni… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims
to accelerate the research and development of audio and speech technologies by providing …

Mlca-avsr: Multi-layer cross attention fusion based audio-visual speech recognition

H Wang, P Guo, P Zhou, L **e - ICASSP 2024-2024 IEEE …, 2024 - ieeexplore.ieee.org
While automatic speech recognition (ASR) systems degrade significantly in noisy
environments, audio-visual speech recognition (AVSR) systems aim to complement the …