Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone

E Casanova, J Weber, CD Shulby… - International …, 2022 - proceedings.mlr.press
YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker
TTS. Our method builds upon the VITS model and adds several novel modifications for zero …

Deep speaker embeddings for Speaker Verification: Review and experimental comparison

M Jakubec, R Jarina, E Lieskovska, P Kasak - Engineering Applications of …, 2024 - Elsevier
The construction of speaker-specific acoustic models for automatic speaker recognition is
almost exclusively based on deep neural network-based speaker embeddings. This work …

Audio-visual person-of-interest deepfake detection

D Cozzolino, A Pianese, M Nießner… - Proceedings of the …, 2023 - openaccess.thecvf.com
Face manipulation technology is advancing very rapidly, and new methods are being
proposed day by day. The aim of this work is to propose a deepfake detector that can cope …

HierSpeech: Bridging the gap between text and speech by hierarchical variational inference using self-supervised representations for speech synthesis

SH Lee, SB Kim, JH Lee, E Song… - Advances in Neural …, 2022 - proceedings.neurips.cc
This paper presents HierSpeech, a high-quality end-to-end text-to-speech (TTS) system
based on a hierarchical conditional variational autoencoder (VAE) utilizing self-supervised …

Voxsrc 2021: The third voxceleb speaker recognition challenge

A Brown, J Huh, JS Chung, A Nagrani… - arxiv preprint arxiv …, 2022 - arxiv.org
The third instalment of the VoxCeleb Speaker Recognition Challenge was held in
conjunction with Interspeech 2021. The aim of this challenge was to assess how well current …

Deepfake audio detection by speaker verification

A Pianese, D Cozzolino, G Poggi… - 2022 IEEE International …, 2022 - ieeexplore.ieee.org
Thanks to recent advances in deep leaning, sophisticated generation tools exist, nowadays,
that produce extremely realistic synthetic speech. However, malicious uses of such tools are …

The ins and outs of speaker recognition: lessons from VoxSRC 2020

Y Kwon, HS Heo, BJ Lee… - ICASSP 2021-2021 IEEE …, 2021 - ieeexplore.ieee.org
The VoxCeleb Speaker Recognition Challenge (VoxSRC) at Interspeech 2020 offers a
challenging evaluation for speaker recognition systems, which includes celebrities playing …

Zmm-tts: Zero-shot multilingual and multispeaker speech synthesis conditioned on self-supervised discrete speech representations

C Gong, X Wang, E Cooper, D Wells… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker,
single-language synthesis. Multilingual TTS systems are limited to resource-rich languages …

Transfer learning framework for low-resource text-to-speech using a large-scale unlabeled speech corpus

M Kim, M Jeong, BJ Choi, S Ahn, JY Lee… - arxiv preprint arxiv …, 2022 - arxiv.org
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus,
which is troublesome to collect. In this paper, we propose a transfer learning framework for …

Antifake: Using adversarial audio to prevent unauthorized speech synthesis

Z Yu, S Zhai, N Zhang - Proceedings of the 2023 ACM SIGSAC …, 2023 - dl.acm.org
The rapid development of deep neural networks and generative AI has catalyzed growth in
realistic speech synthesis. While this technology has great potential to improve lives, it also …