Survey: Transformer-based Models in Data Modality Conversion
Transformers have made significant strides across various artificial intelligence domains,
including natural language processing, computer vision, and audio processing. This …
including natural language processing, computer vision, and audio processing. This …
AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech
Translation (AV2AV) framework where the input and output of the system are multimodal (ie …
Translation (AV2AV) framework where the input and output of the system are multimodal (ie …
Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can
recognize different languages with a single trained model. As the massive multilingual …
recognize different languages with a single trained model. As the massive multilingual …
Multilingual visual speech recognition with a single model by learning with discrete visual speech units
This paper explores sentence-level Multilingual Visual Speech Recognition with a single
model for the first time. As the massive multilingual modeling of visual data requires huge …
model for the first time. As the massive multilingual modeling of visual data requires huge …
Tmt: Tri-modal translation between speech, image, and text by processing different modalities as different languages
The capability to jointly process multi-modal information is becoming an essential task.
However, the limited number of paired multi-modal data and the large computational …
However, the limited number of paired multi-modal data and the large computational …
Fusion Of Discrete Representations and Self-Augmented Representations for Multilingual Automatic Speech Recognition
Self-supervised learning (SSL) models have shown exceptional capabilities across various
speech-processing tasks. Continuous SSL representations are effective but suffer from high …
speech-processing tasks. Continuous SSL representations are effective but suffer from high …
Bridging Facial Imagery and Vocal Reality: Stable Diffusion-Enhanced Voice Generation
Generating novel voices in speech synthesis is a challenging task with potential for creating
versatile voices that are needed in entertainment and research. One of the primary obstacles …
versatile voices that are needed in entertainment and research. One of the primary obstacles …
[PDF][PDF] 離散化語音自監督模型特徵用於多語言語音辨識
王式珩 - 臺灣大學電信工程學研究所學位論文, 2024 - tdr.lib.ntu.edu.tw
摘要語音自監督學習模型在各種語音處理任務中展示了卓越的能力. 使用語音自監督模型連續
特徵訓練模型雖然性能強大, 但卻受限於其高計算和存儲成本. 另一方面, 雖然使用語音自監督 …
特徵訓練模型雖然性能強大, 但卻受限於其高計算和存儲成本. 另一方面, 雖然使用語音自監督 …