Survey: Transformer-based Models in Data Modality Conversion

E Rashno, A Eskandari, A Anand… - arxiv preprint arxiv …, 2024 - arxiv.org
Transformers have made significant strides across various artificial intelligence domains,
including natural language processing, computer vision, and audio processing. This …

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

J Choi, SJ Park, M Kim, YM Ro - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech
Translation (AV2AV) framework where the input and output of the system are multimodal (ie …

Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation

M Kim, J Yeo, SJ Park, H Rha, YM Ro - Proceedings of the 32nd ACM …, 2024 - dl.acm.org
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can
recognize different languages with a single trained model. As the massive multilingual …

Multilingual visual speech recognition with a single model by learning with discrete visual speech units

M Kim, JH Yeo, J Choi, SJ Park, YM Ro - arxiv preprint arxiv:2401.09802, 2024 - arxiv.org
This paper explores sentence-level Multilingual Visual Speech Recognition with a single
model for the first time. As the massive multilingual modeling of visual data requires huge …

Tmt: Tri-modal translation between speech, image, and text by processing different modalities as different languages

M Kim, J Jung, H Rha, S Maiti, S Arora, X Chang… - arxiv preprint arxiv …, 2024 - arxiv.org
The capability to jointly process multi-modal information is becoming an essential task.
However, the limited number of paired multi-modal data and the large computational …

Fusion Of Discrete Representations and Self-Augmented Representations for Multilingual Automatic Speech Recognition

SH Wang, J Shi, C Huang… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org
Self-supervised learning (SSL) models have shown exceptional capabilities across various
speech-processing tasks. Continuous SSL representations are effective but suffer from high …

Bridging Facial Imagery and Vocal Reality: Stable Diffusion-Enhanced Voice Generation

Y Lin, D Liu, Y Xu, H Suo, M Li - 2024 IEEE 14th International …, 2024 - ieeexplore.ieee.org
Generating novel voices in speech synthesis is a challenging task with potential for creating
versatile voices that are needed in entertainment and research. One of the primary obstacles …

[PDF][PDF] 離散化語音自監督模型特徵用於多語言語音辨識

王式珩 - 臺灣大學電信工程學研究所學位論文, 2024 - tdr.lib.ntu.edu.tw
摘要語音自監督學習模型在各種語音處理任務中展示了卓越的能力. 使用語音自監督模型連續
特徵訓練模型雖然性能強大, 但卻受限於其高計算和存儲成本. 另一方面, 雖然使用語音自監督 …