Lip reading for low-resource languages by learning and combining general speech knowledge and language-specific knowledge
This paper proposes a novel lip reading framework, especially for low-resource languages,
which has not been well addressed in the previous literature. Since low-resource languages …
which has not been well addressed in the previous literature. Since low-resource languages …
Exploring speech recognition, translation, and understanding with discrete speech units: A comparative study
Speech signals, typically sampled at rates in the tens of thousands per second, contain
redundancies, evoking inefficiencies in sequence modeling. High-dimensional speech …
redundancies, evoking inefficiencies in sequence modeling. High-dimensional speech …
AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech
Translation (AV2AV) framework where the input and output of the system are multimodal (ie …
Translation (AV2AV) framework where the input and output of the system are multimodal (ie …
Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can
recognize different languages with a single trained model. As the massive multilingual …
recognize different languages with a single trained model. As the massive multilingual …
Speech sense disambiguation: Tackling homophone ambiguity in end-to-end speech translation
End-to-end speech translation (ST) presents notable disambiguation challenges as it
necessitates simultaneous cross-modal and cross-lingual transformations. While word …
necessitates simultaneous cross-modal and cross-lingual transformations. While word …
Towards practical and efficient image-to-speech captioning with vision-language pre-training and multi-modal tokens
In this paper, we propose methods to build a powerful and efficient Image-to-Speech
captioning (Im2Sp) model. To this end, we start with importing the rich knowledge related to …
captioning (Im2Sp) model. To this end, we start with importing the rich knowledge related to …
The Interspeech 2024 Challenge on Speech Processing Using Discrete Units
Representing speech and audio signals in discrete units has become a compelling
alternative to traditional high-dimensional feature vectors. Numerous studies have …
alternative to traditional high-dimensional feature vectors. Numerous studies have …
Multilingual visual speech recognition with a single model by learning with discrete visual speech units
This paper explores sentence-level Multilingual Visual Speech Recognition with a single
model for the first time. As the massive multilingual modeling of visual data requires huge …
model for the first time. As the massive multilingual modeling of visual data requires huge …
Translatotron 3: Speech to speech translation with monolingual data
This paper presents Translatotron 3, a novel approach to unsupervised direct speech-to-
speech translation from monolingual speech-text datasets by combining masked …
speech translation from monolingual speech-text datasets by combining masked …
Tmt: Tri-modal translation between speech, image, and text by processing different modalities as different languages
The capability to jointly process multi-modal information is becoming an essential task.
However, the limited number of paired multi-modal data and the large computational …
However, the limited number of paired multi-modal data and the large computational …