Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters

K Fujita, H Sato, T Ashihara… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
The zero-shot text-to-speech (TTS) method, based on speaker embeddings extracted from
reference speech using self-supervised learning (SSL) speech representations, can …

Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model

K Fujita, T Ashihara, H Kanagawa… - … on Acoustics, Speech …, 2023 - ieeexplore.ieee.org
This paper proposes a zero-shot text-to-speech (TTS) conditioned by a self-supervised
speech-representation model acquired through self-supervised learning (SSL) …

Controllable speech synthesis by learning discrete phoneme-level prosodic representations

N Ellinas, M Christidou, A Vioni, JS Sung… - Speech …, 2023 - Elsevier
In this paper, we present a novel method for phoneme-level prosody control of F0 and
duration using intuitive discrete labels. We propose an unsupervised prosodic clustering …

Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning

SR Mhaskar, NJ Shah, M Zaki, AP Gudmalwar… - arxiv preprint arxiv …, 2024 - arxiv.org
Traditional Automatic Video Dubbing (AVD) pipeline consists of three key modules, namely,
Automatic Speech Recognition (ASR), Neural Machine Translation (NMT), and Text-to …

Speech rhythm-based speaker embeddings extraction from phonemes and phoneme duration for multi-speaker speech synthesis

K Fujita, A Ando, Y Ijima - IEICE TRANSACTIONS on Information …, 2024 - search.ieice.org
This paper proposes a speech rhythm-based method for speaker embeddings to model
phoneme duration using a few utterances by the target speaker. Speech rhythm is one of the …

Analysis of Speech Temporal Dynamics in the Context of Speaker Verification and Voice Anonymization

N Tomashenko, E Vincent, M Tommasi - arxiv preprint arxiv:2412.17164, 2024 - arxiv.org
In this paper, we investigate the impact of speech temporal dynamics in application to
automatic speaker verification and speaker voice anonymization tasks. We propose several …

Incorporating Speaker's Speech Rate Features for Improved Voice Cloning

Q Zhe, I Katunobu - 2023 9th International Conference on …, 2023 - ieeexplore.ieee.org
We investigate a neural network-based text-to-speech (TTS) synthesis system that aims to
simulate the Mandarin voice of different speakers using short voice samples. Our system …

[HTML][HTML] Creating" Shido Twin" by Using Another Me Technology NTT Digital Twin Computing Research Center NTT Human Informatics Laboratories

A Fukayama, R Ishii, A Morikawa, H Noto, S Eitoku… - rd.ntt
“Cho Kabuki 2022 Powered by NTT,” a kabuki play sponsored by Shochiku Co., Ltd., is the
first social implementation of Another Me, a technology for creating a human digital twin that …

韻律特徴を考慮した音声仮名化

伊藤葵, 伊藤克亘 - 第 86 回全国大会講演論文集, 2024 - ipsj.ixsq.nii.ac.jp
論文抄録 音声仮名化によって話者のプライバシーを保護することで, 文字起こしからは読み取れ
ない音声データそのものに含まれる情報 (発話者の意図など) を有効活用できる. 本稿では …

話速モデル化に基づく自然なボイスクローニングの実現

秦哲, 伊藤克亘 - 第 85 回全国大会講演論文集, 2023 - ipsj.ixsq.nii.ac.jp
論文抄録 ボイスクローニングというのは, 話者の特徴を抽出することで, 話者の声で話す TTS
を生成する技術である. 先行研究でのボイスクローニングでは, 入力する音声を増やすことでより自然 …