Lip reading for low-resource languages by learning and combining general speech knowledge and language-specific knowledge

M Kim, JH Yeo, J Choi, YM Ro - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
This paper proposes a novel lip reading framework, especially for low-resource languages,
which has not been well addressed in the previous literature. Since low-resource languages …

Exploring speech recognition, translation, and understanding with discrete speech units: A comparative study

X Chang, B Yan, K Choi, JW Jung, Y Lu… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
Speech signals, typically sampled at rates in the tens of thousands per second, contain
redundancies, evoking inefficiencies in sequence modeling. High-dimensional speech …

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

J Choi, SJ Park, M Kim, YM Ro - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech
Translation (AV2AV) framework where the input and output of the system are multimodal (ie …

Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation

M Kim, J Yeo, SJ Park, H Rha, YM Ro - Proceedings of the 32nd ACM …, 2024 - dl.acm.org
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can
recognize different languages with a single trained model. As the massive multilingual …

Speech sense disambiguation: Tackling homophone ambiguity in end-to-end speech translation

T Yu, X Liu, L Ding, K Chen, D Tao… - Proceedings of the 62nd …, 2024 - aclanthology.org
End-to-end speech translation (ST) presents notable disambiguation challenges as it
necessitates simultaneous cross-modal and cross-lingual transformations. While word …

Towards practical and efficient image-to-speech captioning with vision-language pre-training and multi-modal tokens

M Kim, J Choi, S Maiti, JH Yeo… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
In this paper, we propose methods to build a powerful and efficient Image-to-Speech
captioning (Im2Sp) model. To this end, we start with importing the rich knowledge related to …

The Interspeech 2024 Challenge on Speech Processing Using Discrete Units

X Chang, J Shi, J Tian, Y Wu, Y Tang, Y Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
Representing speech and audio signals in discrete units has become a compelling
alternative to traditional high-dimensional feature vectors. Numerous studies have …

Multilingual visual speech recognition with a single model by learning with discrete visual speech units

M Kim, JH Yeo, J Choi, SJ Park, YM Ro - arxiv preprint arxiv:2401.09802, 2024 - arxiv.org
This paper explores sentence-level Multilingual Visual Speech Recognition with a single
model for the first time. As the massive multilingual modeling of visual data requires huge …

Translatotron 3: Speech to speech translation with monolingual data

E Nachmani, A Levkovitch, Y Ding… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
This paper presents Translatotron 3, a novel approach to unsupervised direct speech-to-
speech translation from monolingual speech-text datasets by combining masked …

Tmt: Tri-modal translation between speech, image, and text by processing different modalities as different languages

M Kim, J Jung, H Rha, S Maiti, S Arora, X Chang… - arxiv preprint arxiv …, 2024 - arxiv.org
The capability to jointly process multi-modal information is becoming an essential task.
However, the limited number of paired multi-modal data and the large computational …