- Academic Search

Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters

K Fujita, H Sato, T Ashihara… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org

The zero-shot text-to-speech (TTS) method, based on speaker embeddings extracted from
reference speech using self-supervised learning (SSL) speech representations, can …

Salva Cita Citato da 11 Articoli correlati Tutte e 3 le versioni

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model

K Fujita, T Ashihara, H Kanagawa… - … on Acoustics, Speech …, 2023 - ieeexplore.ieee.org

This paper proposes a zero-shot text-to-speech (TTS) conditioned by a self-supervised
speech-representation model acquired through self-supervised learning (SSL) …

Salva Cita Citato da 10 Articoli correlati Tutte e 3 le versioni

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Controllable speech synthesis by learning discrete phoneme-level prosodic representations

N Ellinas, M Christidou, A Vioni, JS Sung… - Speech …, 2023 - Elsevier

In this paper, we present a novel method for phoneme-level prosody control of F0 and
duration using intuitive discrete labels. We propose an unsupervised prosodic clustering …

Salva Cita Citato da 5 Articoli correlati Tutte e 4 le versioni

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning

SR Mhaskar, NJ Shah, M Zaki, AP Gudmalwar… - arxiv preprint arxiv …, 2024 - arxiv.org

Traditional Automatic Video Dubbing (AVD) pipeline consists of three key modules, namely,
Automatic Speech Recognition (ASR), Neural Machine Translation (NMT), and Text-to …

Salva Cita Citato da 1 Articoli correlati Tutte e 3 le versioni Versione HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Speech rhythm-based speaker embeddings extraction from phonemes and phoneme duration for multi-speaker speech synthesis

K Fujita, A Ando, Y Ijima - IEICE TRANSACTIONS on Information …, 2024 - search.ieice.org

This paper proposes a speech rhythm-based method for speaker embeddings to model
phoneme duration using a few utterances by the target speaker. Speech rhythm is one of the …

Salva Cita Citato da 1 Articoli correlati Tutte e 6 le versioni

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Analysis of Speech Temporal Dynamics in the Context of Speaker Verification and Voice Anonymization

N Tomashenko, E Vincent, M Tommasi - arxiv preprint arxiv:2412.17164, 2024 - arxiv.org

In this paper, we investigate the impact of speech temporal dynamics in application to
automatic speaker verification and speaker voice anonymization tasks. We propose several …

Salva Cita Articoli correlati Tutte e 8 le versioni Versione HTML

Incorporating Speaker's Speech Rate Features for Improved Voice Cloning

Q Zhe, I Katunobu - 2023 9th International Conference on …, 2023 - ieeexplore.ieee.org

We investigate a neural network-based text-to-speech (TTS) synthesis system that aims to
simulate the Mandarin voice of different speakers using short voice samples. Our system …

Salva Cita Articoli correlati Tutte e 2 le versioni

[Free GPT-4]
[DeepSeek]

[HTML] rd.ntt

[HTML][HTML] Creating" Shido Twin" by Using Another Me Technology NTT Digital Twin Computing Research Center NTT Human Informatics Laboratories

A Fukayama, R Ishii, A Morikawa, H Noto, S Eitoku… - rd.ntt

“Cho Kabuki 2022 Powered by NTT,” a kabuki play sponsored by Shochiku Co., Ltd., is the
first social implementation of Another Me, a technology for creating a human digital twin that …

Salva Cita Articoli correlati Copia cache

[Free GPT-4]
[DeepSeek]

[PDF] nii.ac.jp

韻律特徴を考慮した音声仮名化

伊藤葵，伊藤克亘 - 第 86 回全国大会講演論文集, 2024 - ipsj.ixsq.nii.ac.jp

論文抄録音声仮名化によって話者のプライバシーを保護することで, 文字起こしからは読み取れ
ない音声データそのものに含まれる情報 (発話者の意図など) を有効活用できる. 本稿では …

Salva Cita Articoli correlati Tutte e 3 le versioni Versione HTML

[Free GPT-4]
[DeepSeek]

[PDF] nii.ac.jp

話速モデル化に基づく自然なボイスクローニングの実現

秦哲，伊藤克亘 - 第 85 回全国大会講演論文集, 2023 - ipsj.ixsq.nii.ac.jp

論文抄録ボイスクローニングというのは, 話者の特徴を抽出することで, 話者の声で話す TTS
を生成する技術である. 先行研究でのボイスクローニングでは, 入力する音声を増やすことでより自然 …

Salva Cita Articoli correlati Tutte e 3 le versioni Versione HTML

Crea avviso

Cita

Ricerca avanzata

Salvato in La mia biblioteca

Phoneme Duration Modeling Using Speech Rhythm-Based Speaker Embeddings for Multi-Speaker...

Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters

Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model

Controllable speech synthesis by learning discrete phoneme-level prosodic representations

Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning

Speech rhythm-based speaker embeddings extraction from phonemes and phoneme duration for multi-speaker speech synthesis

Analysis of Speech Temporal Dynamics in the Context of Speaker Verification and Voice Anonymization

Incorporating Speaker's Speech Rate Features for Improved Voice Cloning

[HTML][HTML] Creating" Shido Twin" by Using Another Me Technology NTT Digital Twin Computing Research Center NTT Human Informatics Laboratories

韻律特徴を考慮した音声仮名化

話速モデル化に基づく自然なボイスクローニングの実現