Deep learning for visual speech analysis: A survey
Visual speech, referring to the visual domain of speech, has attracted increasing attention
due to its wide applications, such as public security, medical treatment, military defense, and …
due to its wide applications, such as public security, medical treatment, military defense, and …
SeamlessM4T-Massively Multilingual & Multimodal Machine Translation
What does it take to create the Babel Fish, a tool that can help individuals translate speech
between any two languages? While recent breakthroughs in text-based models have …
between any two languages? While recent breakthroughs in text-based models have …
A review of recent advances on deep learning methods for audio-visual speech recognition
This article provides a detailed review of recent advances in audio-visual speech
recognition (AVSR) methods that have been developed over the last decade (2013–2023) …
recognition (AVSR) methods that have been developed over the last decade (2013–2023) …
AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech
Translation (AV2AV) framework where the input and output of the system are multimodal (ie …
Translation (AV2AV) framework where the input and output of the system are multimodal (ie …
Audio–visual speech recognition based on regulated transformer and spatio–temporal fusion strategy for driver assistive systems
This article presents a research methodology for audio–visual speech recognition (AVSR) in
driver assistive systems. These systems necessitate ongoing interaction with drivers while …
driver assistive systems. These systems necessitate ongoing interaction with drivers while …
Textless Unit-to-Unit Training for Many-to-Many Multilingual Speech-to-Speech Translation
This paper proposes a textless training method for many-to-many multilingual speech-to-
speech translation that can also benefit the transfer of pre-trained knowledge to text-based …
speech translation that can also benefit the transfer of pre-trained knowledge to text-based …
DASS: Distilled Audio State Space Models are Stronger and More Duration-Scalable Learners
State-space models (SSMs) have emerged as an alternative to Transformers for audio
modeling due to their high computational efficiency with long inputs. While recent efforts on …
modeling due to their high computational efficiency with long inputs. While recent efforts on …
Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer
Humans are adept at leveraging visual cues from lip movements for recognizing speech in
adverse listening conditions. Audio-Visual Speech Recognition (AVSR) models follow …
adverse listening conditions. Audio-Visual Speech Recognition (AVSR) models follow …
Tailored Design of Audio-Visual Speech Recognition Models using Branchformers
Recent advances in Audio-Visual Speech Recognition (AVSR) have led to unprecedented
achievements in the field, improving the robustness of this type of system in adverse, noisy …
achievements in the field, improving the robustness of this type of system in adverse, noisy …
XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception
Speech recognition and translation systems perform poorly on noisy inputs, which are
frequent in realistic environments. Augmenting these systems with visual signals has the …
frequent in realistic environments. Augmenting these systems with visual signals has the …