Deep learning for visual speech analysis: A survey

C Sheng, G Kuang, L Bai, C Hou, Y Guo… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Visual speech, referring to the visual domain of speech, has attracted increasing attention
due to its wide applications, such as public security, medical treatment, military defense, and …

SeamlessM4T-Massively Multilingual & Multimodal Machine Translation

L Barrault, YA Chung, MC Meglioli, D Dale… - arxiv preprint arxiv …, 2023 - arxiv.org
What does it take to create the Babel Fish, a tool that can help individuals translate speech
between any two languages? While recent breakthroughs in text-based models have …

A review of recent advances on deep learning methods for audio-visual speech recognition

D Ivanko, D Ryumin, A Karpov - Mathematics, 2023 - mdpi.com
This article provides a detailed review of recent advances in audio-visual speech
recognition (AVSR) methods that have been developed over the last decade (2013–2023) …

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

J Choi, SJ Park, M Kim, YM Ro - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech
Translation (AV2AV) framework where the input and output of the system are multimodal (ie …

Audio–visual speech recognition based on regulated transformer and spatio–temporal fusion strategy for driver assistive systems

D Ryumin, A Axyonov, E Ryumina, D Ivanko… - Expert Systems with …, 2024 - Elsevier
This article presents a research methodology for audio–visual speech recognition (AVSR) in
driver assistive systems. These systems necessitate ongoing interaction with drivers while …

Textless Unit-to-Unit Training for Many-to-Many Multilingual Speech-to-Speech Translation

M Kim, J Choi, D Kim, YM Ro - IEEE/ACM Transactions on …, 2024 - ieeexplore.ieee.org
This paper proposes a textless training method for many-to-many multilingual speech-to-
speech translation that can also benefit the transfer of pre-trained knowledge to text-based …

DASS: Distilled Audio State Space Models are Stronger and More Duration-Scalable Learners

S Bhati, Y Gong, L Karlinsky, H Kuehne… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org
State-space models (SSMs) have emerged as an alternative to Transformers for audio
modeling due to their high computational efficiency with long inputs. While recent efforts on …

Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer

M Burchi, KC Puvvada, J Balam… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
Humans are adept at leveraging visual cues from lip movements for recognizing speech in
adverse listening conditions. Audio-Visual Speech Recognition (AVSR) models follow …

Tailored Design of Audio-Visual Speech Recognition Models using Branchformers

D Gimeno-Gómez, CD Martínez-Hinarejos - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advances in Audio-Visual Speech Recognition (AVSR) have led to unprecedented
achievements in the field, improving the robustness of this type of system in adverse, noisy …

XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

HJ Han, M Anwar, J Pino, WN Hsu, M Carpuat… - arxiv preprint arxiv …, 2024 - arxiv.org
Speech recognition and translation systems perform poorly on noisy inputs, which are
frequent in realistic environments. Augmenting these systems with visual signals has the …